Background¶

English-language folk songs have a long tradition and have changed over time. Songs are not easily idenifiable by name alone, and lyrics often have variations. Steve Roud began indexing his own collection in the 1970s, and his Roud Index has become the standard for grouping together different versions of the same song. He is still indexing as of 2023.

Could a machine learning algorithm hope to match his skill? Given the lyrics, would it choose the same groupings of songs, where the line between "same" and "different" is fuzzy? Could it help with future indexing?

Data extraction¶

Source overview¶

Although the Roud index is a lyrics-based classification system (rather than tune-based), the officially-hosted index at vwml.com does not contain lyric transcriptions as a standard data field. Some lyrics are accessible online, presented in scanned images of historical collections, others on linked external sites, others not at all.

So the first challenge is to get a dataset with enough full lyrics and Roud numbers in combination. The main contenders for the source of this data are Mudcat and The Traditional Ballad Index, both well-established online song databases.

Mudcat¶

  • Project focuses on song lyrics and tunes, but also contains Roud numbers for approximately 300 songs.
  • Data and formats:
    • Digitrad (DT) download: askSam MS-DOS database (last updated in 2002)
    • Song web pages
    • Forum posts containing songs

The Traditional Ballad Index¶

  • Project focuses on cataloguing*, but also has supplementary lyrics for approximately 1110 songs.
  • Data and formats:
    • The Ballad Index Software download: Claris Filemaker database
    • Song web pages (without lyrics)
    • The Ballad Index (BI) and The Supplemental Tradition (ST) (lyrics) as HTML or TXT lists

* This is a similar to approach to Roud, but focused on the basic unit of a song rather than its individual instances (e.g. variations, songbook entries or performances), and therefore uses song titles as its main identifiers, with keywords and first line for disambiguation.

Extraction process¶

Neither the Ballad Index (which would have included ST lyrics) nor the Mudcat Digitrad downloadable databases will open.

In order to link Roud numbers to lyrics, I therefore need to work with the .txt version of the Ballad Index (which does not include ST) as my base for a new database, extract the records from it, then join ST and DT's lyrics to these records using the various references provided in each data source.

Linking data: Filenames as keys¶

To link the lyrics correctly to the main data of the BI, I need fields that act as idenifiers/keys:

BI filename¶

Alphanumeric filename serving as an identifier for all BI records, also referenced by ST lyrics where they exist.

DT filename¶

8.3 filename (all-caps without extension) serving as an identifier for all DT records, also sometimes referenced in BI.

  • Note: in a minority of cases, modified DT filenames also appear to be used as the main BI filenames ('DT' + first six characters in lower or title case), e.g. 'DToatsbe' is the same as 'OATSBEAN' in DT). However, this occasionally disagrees with the stated DT filename for the BI record.

Other numbers and references:¶

DT number: Many records in DT and BI also contain a 'DT #'. This number is not the same as the DT file, and, contrary to my first assumption, nor does it correspond to the SongID in Mudcat URLs (e.g. http://mudcat.org/@displaysong.cfm?SongID=329). It appears to be another grouping system developed by Mudcat and intended to extend Child numbers (see below): "Francis J. Child only went up to 305--since there are ballads he didn't include, you may notice some numbers like DT #510 . Not to worry--it just helps locate variants".

Roud number: Found in BI only (at least as far as downloadable data is concerned - song lyrics on Mudcat's website do often include this).

Child number: The Child Ballads were the first large collection of songs of English and Scottish origin collected by Francis James Child in the 1800s. Many songs contained multiple versions. Child numbers (1-305) are often referenced in folk song sources.

Laws number: George Malcolm Laws and the American Folklore Society published a collection of traditional songs in 1957. Laws numbers contain an initial letter which indicates the song's theme, e.g. 'M: Ballads of Family Opposition to Lovers'. Laws numbers are also commonly referenced.

Other collections: References to other collections are sometimes found, and some of these also have their own numbers for songs.

Extraction quantity targets (BI, ST, DT)¶

Based on text editor finds I estimate I can extract approximately the following data [with comparisons for a Google domain search of online versions]:

  • BI: 30445 song record files, of which (in combination):
    • 14213 are stubs for variants that only refer to other songs
    • 2623 refer to DT files (lyrics) [compare: Google search: 357]; 356 have BI filenames referring to a DT filename
    • 1180 refer to ST files (lyrics) [compare: Google search: 395]
    • 12126 of these contain Roud index numbers [compare: Google search: 2700]
  • ST: 1229 lyrics referencing 1136 BI files [no separate online version]
  • DT: 8932 song record files (lyrics)
    • only 1 contains a Roud number [compare: Google search of newer web version: 435]

BI (Ballad Index)¶

Below is a preview of balldidx.txt. The text version of the Ballad Index file is tricky to work with as entries are presented as a list with inconsistent headings and mixed data.

I first used a text editor to place colons before Roud numbers and DT filenames, so that they could be more easily matched. (This could have been perhaps better achieved with regex, although to begin I decided to save myself a step as they were formatted inconsistently.)

Here it is interesting to note that the BI database also references Mudcat's DT filenames, for example 'DT, MASS1913*' above. This means we can also supplement lyrics by cross-referencing this data.

===
VERSION 6.5, February 26, 2023
===
NAME: 10,000 Years Ago: see I Was Born About Ten Thousand Years Ago (Bragging Song) (File: R410)
===
NAME: 10th MTB Flotilla Song: see Fred Karno's Army (File: NeFrKaAr)
===
NAME: 13 Highway
DESCRIPTION: "I went down 13 highway, Down in my baby's door Raining and storming, Scarcely see the road." "Clouds dark as night, If my baby don't fail me I'll make every thing all right" "Going 60 miles an hour..." "Don't the highway look lonesome..."
AUTHOR: unknown
EARLIEST_DATE: 1938 (recording, Walter Davis)
KEYWORDS: grief love promise nonballad lover technology
FOUND_IN: US(SE)
REFERENCES: (0 citations)
Roud #29487
RECORDINGS:
Walter Davis, "13 Highway" (Bluebird B7693, 1938)
Moses Williams, "13 Highway" (on USFlorida01)
NOTES: Moses Williams sings "I always wonder why ... That woman didn't treat me right." The description follows the Walter Davis recording. - BS
Last updated in version 5.0
File: Rc13Hwy
===
NAME: 151 Days: see Hundred and Fifty-One Days (File: Colq060)
===
NAME: 1861 Anti Confederation Song, An: see Anti-Confederation Song (File: FJ028)
===
NAME: 1913 Massacre
DESCRIPTION: In Calumet, Michigan, striking copper miners and their children are having a Christmas celebration; strike-breakers outside bar the doors then raise a false fire alarm. In the ensuing stampede, seventy-three children are crushed or suffocated
AUTHOR: Woody Guthrie
EARLIEST_DATE: 1945 (recording by author)
KEYWORDS: lie strike death labor-movement mining disaster children
FOUND_IN: US
REFERENCES: (3 citations)
Greenway-AmericanFolksongsOfProtest, pp. 157-158, "1913 Massacre"
Silber/Silber-FolksingersWordbook, p. 306, "The 1913 Massacre" (1 text)
DT, MASS1913*
Roud #17663
RECORDINGS:
Woody Guthrie, "1913 Massacre" (Asch 360, 1945; on Struggle1, Struggle2)
CROSS_REFERENCES:
cf. "One Morning in May (To Hear the Nightingale Sing)" (tune)
NOTES: In the late 19th/early 20th century, the rapid expansion of the electrical industry created great demand for copper, for which the chief source was the mines in the upper peninsula of Michigan. Bitter strikes resulted as the miners, under the leadership of the Western Federation of Miners, demanded decent pay and safer working conditions.
Guthrie's description of the events of 1913 is dead-on accurate, according to the residents of Calumet; Italian Hall, where the disaster occurred, was still standing in the early 1980s, but has since been torn down. - PJS
There is an historical marker on the site (Italian Hall, 7th Street, Calumet, MI, at its junction with Elm Street, one block south of Highway 203), and the site has not been built over. One of the plaques has a picture of Woody and mentions this song. There are quite a few recent photos of the site on Google Maps. - RBW
Last updated in version 6.1
File: FSWB306A

I then used a script with regular expressions to import while doing the following:

  • split song records at the marker '==='
  • extract only the values for 'name', 'description', 'earliest_date', found_in', 'keywords', 'cross_references', 'roud', 'bi_file', 'st_file', and 'dt_file'
  • split and store reference song name and filename information in one-line stub records that only serve to reference a main song
  • extract only the earliest year found in the 'EARLIEST_FOUND:' field which contained mixed data
  • replace empty fields with NumPy NaN to allow for better data manipulation

These are stored in df_bi.

Target: 30445 file records | Output: 30418 file records

Out[1]:
name key_name keywords description long_description found_in bi_file st_file dt_file roud
0 10,000 Years Ago I Was Born About Ten Thousand Years Ago (Bragg... NaN NaN NaN NaN R410 NaN NaN NaN
1 10th MTB Flotilla Song Fred Karno's Army NaN NaN NaN NaN NeFrKaAr NaN NaN NaN
2 13 Highway NaN grief love promise nonballad lover technology "I went down 13 highway, Down in my baby's doo... NaN US(SE) Rc13Hwy NaN NaN 29487
3 151 Days Hundred and Fifty-One Days NaN NaN NaN NaN Colq060 NaN NaN NaN
4 1861 Anti Confederation Song, An Anti-Confederation Song NaN NaN NaN NaN FJ028 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
30412 Zula NaN love rejection separation travel "Thou lov'st another, Zula, Thou lovest him al... NaN US(So) Brne049 NaN NaN 11330
30413 Zulu Warrior, The NaN nonballad nonsense campsong "I-kama zimba zimba zayo I-kama zimba zimba ze... NaN NaN ACFF061A NaN NaN NaN
30414 Zum Gali Gali NaN foreignlanguage campsong Hebrew. "Zum, gali-gali-gali, Zum gali-gali, Z... NaN NaN ACSF314Z NaN NaN NaN
30415 Zutula Dead NaN death poison food A nice girl gave Zutula bitter casava to eat a... NaN West Indies(Trinidad) RcALZuDe NaN NaN NaN
30416 Zwei Soldaten, Die NaN foreignlanguage soldier food homicide suicide ... German. "Es war einmal zwei Bauersohn, Die hat... NaN US(MW) RDL056 NaN NaN NaN

30417 rows × 10 columns

Stub inheritance¶

Next I want to make stubs inherit Roud number and file references from their parent entries. I do this via a lookup table containing only those 'bi_file' entries that have the other data associated:

Out[2]:
bi_file st_file dt_file roud
2 Rc13Hwy NaN NaN 29487
5 FSWB306A NaN MASS1913* 17663
8 Hopk112 NaN NaN 29405
11 Hopk039 NaN NaN 29404
12 Hopk046 NaN NaN 29403
... ... ... ... ...
30399 San449 San449 (Full) NaN 12174
30403 SuSm091B NaN NaN 20694
30405 Dett196 NaN NaN 15233
30406 Fus214 Fus214 (Partial) NaN 16373
30412 Brne049 NaN NaN 11330

12420 rows × 4 columns

Out[4]:
name key_name keywords description long_description found_in bi_file st_file dt_file roud
0 10,000 Years Ago I Was Born About Ten Thousand Years Ago (Bragg... NaN NaN NaN NaN R410 NaN NaN NaN
1 10th MTB Flotilla Song Fred Karno's Army NaN NaN NaN NaN NeFrKaAr NaN NaN NaN
2 13 Highway NaN grief love promise nonballad lover technology "I went down 13 highway, Down in my baby's doo... NaN US(SE) Rc13Hwy NaN NaN 29487
3 151 Days Hundred and Fifty-One Days NaN NaN NaN NaN Colq060 NaN NaN NaN
4 1861 Anti Confederation Song, An Anti-Confederation Song NaN NaN NaN NaN FJ028 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ...
30412 Zula NaN love rejection separation travel "Thou lov'st another, Zula, Thou lovest him al... NaN US(So) Brne049 NaN NaN 11330
30413 Zulu Warrior, The NaN nonballad nonsense campsong "I-kama zimba zimba zayo I-kama zimba zimba ze... NaN NaN ACFF061A NaN NaN NaN
30414 Zum Gali Gali NaN foreignlanguage campsong Hebrew. "Zum, gali-gali-gali, Zum gali-gali, Z... NaN NaN ACSF314Z NaN NaN NaN
30415 Zutula Dead NaN death poison food A nice girl gave Zutula bitter casava to eat a... NaN West Indies(Trinidad) RcALZuDe NaN NaN NaN
30416 Zwei Soldaten, Die NaN foreignlanguage soldier food homicide suicide ... German. "Es war einmal zwei Bauersohn, Die hat... NaN US(MW) RDL056 NaN NaN NaN

30417 rows × 10 columns

Clean and split multiple 'dt_file' entries¶

Next I need to handle cases where more than one DT filename is associated with each record, to allow for correct data merging later. I will assign the duplicates to new rows, first discarding DT numbers and other characters that do not constitute a valid DT filename.

A visual check suggests the cleaning worked:

Out[6]:
name bi_file dt_file
13779 Johnny Fill Up the Bowl (In Eighteen Hundred a... R227 ABEWASH FORBALES
176 Admiral Benbow (I) PBB076 ADBENBOW ADBENBW2
15110 Let Me In This Ae Nicht DTaenich AENICHT COLDRAIN
198 After the Ball SRW169 AFTRBALL UNFORTU6
223 Aiken Drum OO2007 AIKDRUM AIKDRUM3
... ... ... ...
28425 Weaver and the Factory Maid, The DTwvfact WVFACTGL WEAVFACT
3176 Brisk Young Butcher, The DTxmasgo XMASGOOS XMASGOO2
22203 Rare Willie Drowned in Yarrow, or, The Water o... C215 YARROW2 YARROW3
30179 Young Allan [Child 245] C245 YNGALAN YNGALAN2
23899 Seventeen Come Sunday [Laws O17] LO17 YONHIGH ROCKYMT TROOPRM2

450 rows × 3 columns

Now to split the valid filenames into their own rows and examine the changed rows:

/tmp/ipykernel_158075/1738651564.py:16: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  new_df = new_df.append(row, ignore_index=True)  # Include rows with NaN filenames
/tmp/ipykernel_158075/1738651564.py:14: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  new_df = new_df.append(new_row, ignore_index=True)
Out[8]:
name bi_file dt_file
85 Abdul the Bulbul Emir (II) EM210 ABDULBL2
84 Abdul the Bulbul Emir (I) LxA341 ABDULBUL
48 A-Begging I Will Go K217 ABEGGIN
14132 Johnny Fill Up the Bowl (In Eighteen Hundred a... R227 ABEWASH
5795 David's Lamentation FSWB412B ABSALON
... ... ... ...
6848 Drummer Boy of Waterloo, The [Laws J1] LJ01 YOUNGED
14832 Kingdom Coming (The Year of Jubilo) R230 YRJUBILO
31061 Zack, the Mormon Engineer BRaF444 ZACKMORM
31067 Zebra Dun, The [Laws B16] LB16 ZEBRADUN
31064 Zeb Tourney's Girl [Laws E18] LE18 ZEBTURNY

2489 rows × 3 columns

Extract DT filenames from BI filenames¶

The next stage is to extract possible DT filenames from the bi_file entries. I will match these later on the DT data. Of the entries with BI filenames with this 'DTxxxxxx' pattern, 168 are missing DT filenames:

Out[9]:
name bi_file dt_file
432 Allan Water DTalanwa NaN
493 Altoona Freight Wreck, The DTwrck12 NaN
1133 B'y' Sara Burned Down DTBayous NaN
1188 Back and Side Go Bare, Go Bare! DTbcksid NaN
1387 Banks of Allen Water, The DTalanwa NaN
... ... ... ...
30282 Winter It Is Past, The DTcurrki NaN
30290 Winter's Gone and Past DTcurrki NaN
30307 Wise Willie DTcutywr NaN
30455 Wreck at Latona, The DTwrck12 NaN
31046 Your Grannie and Your Other Grannie DTgranbu NaN

168 rows × 3 columns

I will strip 'DT' from these filenames and insert them into the column 'dt_file' for affected rows:

Out[10]:
name bi_file dt_file
432 Allan Water DTalanwa alanwa
493 Altoona Freight Wreck, The DTwrck12 wrck12
1133 B'y' Sara Burned Down DTBayous Bayous
1188 Back and Side Go Bare, Go Bare! DTbcksid bcksid
1387 Banks of Allen Water, The DTalanwa alanwa
... ... ... ...
30282 Winter It Is Past, The DTcurrki currki
30290 Winter's Gone and Past DTcurrki currki
30307 Wise Willie DTcutywr cutywr
30455 Wreck at Latona, The DTwrck12 wrck12
31046 Your Grannie and Your Other Grannie DTgranbu granbu

168 rows × 3 columns

Target check:¶

DT file references:

Target: 2623 | Initial output: 2605 | Post-split: 3264

Out[11]:
3432

ST file references:

Target: 1180 | Initial output: 1166 | Post-split: 1200

Out[12]:
1200

Roud references:

Target: 12126 | Initial output: 12004 | Post-split: 12656

Out[13]:
12656

Unique roud numbers: 11266 (note that this is inaccurate as multiple Roud numbers per field are sometimes still present)

Out[14]:
11266

The following query shows I would have 3782 songs with Roud numbers and lyrics, if I were to now join up the data and all the referenced lyrics files could be extracted.

Out[15]:
name roud bi_file st_file dt_file
10003 Gypsy Laddie, The [Child 200] 1 C200 NaN GYPLADD3
10000 Gypsy Laddie, The [Child 200] 1 C200 NaN GYPDAVY
10001 Gypsy Laddie, The [Child 200] 1 C200 NaN GYPLADD
10002 Gypsy Laddie, The [Child 200] 1 C200 NaN GYPLADD2
10011 Gypsy Laddie, The [Child 200] 1 C200 NaN GYPLADY
... ... ... ... ... ...
16274 Lord Cornwallis's Surrender V50597 SBoA088 NaN LRDCRNWL
17588 Memory of the Dead, The V5143 PGa039 NaN MEMRYDED
25860 Star-Spangled Banner, The V5200 MKr015 NaN STARSPAN
6797 Drive the Cold Winter Away (In Praise of Chris... V9375 Log293 Log293 (Full) ALLHAIL
6796 Drive the Cold Winter Away (In Praise of Chris... V9375 Log293 Log293 (Full) DRIVCOLD

3784 rows × 5 columns

This number of songs could even increase if I could match variant lyrics also based on variant titles or if any new backwards file references to BI files are found in the two lyrics data sources.

I will copy these test modifications into df_bi and save to csv for further use.

ST (Supplementary Tradition of BI)¶

The Supplementary Tradition is the lyrics index of the Ballad Index. Again, I must use regular expressions to extract the data, this time from supptrad.txt. This has a different format to the BI.

The main song title is listed at the head of the records, followed by the type of lyrics [Complete text(s) or Partial text(s)] followed by different versions of the lyrics marked [*** A ***, *** B ***, *** C ***, ...] often preceded by an alternate title and notes about the story and/or provenance of the lyrics.

===
Version 6.5  February 26, 2022
===

A Robin, Jolly Robin
  Complete text(s)

          *** A ***

A Robyn Jolly Robyn

From Percy/Wheatley, I.ii.4, pp. 186-187

"[P]rinted from what appears to be the most ancient of Dr.
Harrington's poetical MSS. and which has, therefore, been marked
[...]

A Robyn,
  Jolly Robyn,
Tell me how thy leman doeth
  And thou shalt know of myn.

'My lady is unkynde perde.'
  Alack! why is she so?
'She loveth an other better than me;
  And yet she will say no.'
[...]

          *** B ***

(No title)

From Shakespeare, "Twelfth Night" Act IV, scene 2. In the scene,
the Clown and Malvolio are talking past each other. The text
below shows the reconstructed lines of the song, with Malvolio's
answers in the margin. Line numbers are in the left margin.

71 'Hey, Robin, jolly Robin,
72    Tell me how thy lady does.'      Malv: Fool.
74 'My lady is unkind, perdie!'        Malv: Fool.
76 'Alas, why is she so?'              Malv: Fool, I say.
78 'She loves another.'  Who calls, ha?

File: Perc1185
===

A, U, Hinny Bird
  Partial text(s)

          *** A ***

From Stokoe/Reay, Songs and Ballads of Northern England, pp. 160-161.

Its O, but aw ken well --
    A, U, hinny burd;
The bonny lass o' Benwell,
    A, U, A.
[...]

Due to the aforementioned song-based classification system of the BI, multiple alternate versions are often linked to one BI record file and key title. Later I may want to split the files into different versions, so I will treat the the main song record as a parent ('key_...') and treat the versions as children which will stand as individual records but inherit some values from their parents. Some of the alternate versions do not have their own names.

I want to extract: key_name, key_full_part, version_in_key, name, provenance [detected to exclude from lyrics], lyrics, bi_file [this belongs to key/parent but I want to name consistently for later data combinations]

Out[18]:
key_name key_full_part bi_file version_in_key provenance name lyrics
0 A Robin, Jolly Robin Complete text(s) Perc1185 A From Percy/Wheatley, I.ii.4, pp. 186-187 A Robyn Jolly Robyn "[F]rom what appears to be the most ancient of...
1 A Robin, Jolly Robin Complete text(s) Perc1185 B From Shakespeare, "Twelfth Night" Act IV, scen... (No title) 71 'Hey, Robin, jolly Robin, 72 Tell me how...
2 A, U, Hinny Bird Partial text(s) StoR160 A From Stokoe/Reay, Songs and Ballads of Norther... NaN A, U, hinny burd; The bonny lass o' Benwell, A...
3 Adieu to Erin (The Emigrant) Complete text(s) SWMS255 A As found in Gale Huntington, Songs the Whaleme... Adieu to Erin Oh, when I breathed a last adieu, To Erin's an...
4 Agincourt Carol, The Complete text(s) MEL51 A From the Bodleian Library (Cambridge), MS. Sel... The Song of Agincourt Deo gracias anglia, Redde pro victoria, 1 Owre...
... ... ... ... ... ... ... ...
1224 Young Strongbow Partial text(s) FlNG210 A From Helen Hartness Flanders, Elizabeth Flande... NaN In olden times there came, A likely youth who ...
1225 Young Waters [Child 94] Complete text(s) C094 A From Percy/Wheatley, II.ii.18, pp. 229-231 NaN one sheet 8vo.", About Yule, quhen the wind bl...
1226 Zeb Tourney's Girl [Laws E18] Complete text(s) LE18 A As recorded by Vernon Dalhart, 1926. Transcrib... NaN Down in the Tennessee mountains, Away from the...
1227 Zek'l Weep Complete text(s) San449 A From Carl Sandburg, The American Songbag, pp. ... NaN 1 Zek'l weep, Zek'l moan, Flesh come a-creepin...
1228 Zion's Sons and Daughters Partial text(s) Fus214 A From Harvey H. Fuson, Ballads of the Kentucky ... NaN See the fountain opened wide, That from sinnin...

1229 rows × 7 columns

Target: 1136 records | Output: 1229 records

DT (Mudcat's Digitrad)¶

The only Mudcat Digitrad file available to download is an askSam 32-bit MS-DOS database which I was not able to open. I was able to access a database file in the ZIP where lyrics were visible in plan text. However, a lack of consistent record delimiters, field labels/delimiters, and the presence of many (often invisible) unicode control characters made extraction challenging and unreliable.

askSamx|	~ 7|*2Ò
PƒÛÍ{Ü͌e
Œ ƒÛÍ`Ó&{forget}{rem}¥ö: ] :SI5ÿÿUse your cursor control keys to light up a search, then press <CR>:ÿ_____________________________________________________________________________ :QUICKÿLISTÿLists filenames and titles of all songs whose filename, firstÿline or keyword list contains the search string you enter.ÿSearch string must be a single word or phrase. :FULL SEARCHÿLists filenames and titles of all songs containing the searchÿstring anywhere within the text. Multiple words OK. :RETURNÿGo back to previous list. You can repeat this until you doÿanother search. :CONTEXTÿÿLists the filename and an in-context view of any other word inÿÿall songs that meet your search specification. ÿ______________________________________________________________________________ÿPress spacebar for more options: The next screens let you save files to disk,ÿÿand have several Help menus.g&:DISKÿLISTÿSaves the list of remembered items to a file called LIST.TXT:DISKÿSHOWÿSaves the lyrics of the files in the remembered list to a fileÿcalled SHOW.TXT2¥ [...] If you type MA while the text of the song is displayed, the"{mes :r15 "lightbar will shrink to cover a single word. Light up any word"{mes :r16 "that interests you, and a press of <Enter> will find all the songs "{mes :r17 "that have that word."{mes :r18 "To print a list of titles (with filenames), quit the program,"{mes :r19 " set up your printer for 12 pitch (elite), type <titles> and have "Ë>{mes :r20 " patience --- there are some 26 or so pages of this stuff. To print"{mes :r21 " a list of the keywords we've used, turn on your printer and type: "{mes :r22p " <KEYWORDS>. There are only a couple of pages of these."s Û:CONTEXT{screen free forget}""ENTER SEARCH SPEC. HERE" filename[ {col 15} {opt show :3 "" ENTER KEY WORD "}{rem} {if cou} play.exe*{row-1} {col 55} {"*"}Æ#Ë'ARD TAC1.I'm a shearer, yes I am, and I've shorn 'em sheep and lambFrom the Wimmera to the Darling Downs and back,And I've rung a shed or two where the fleece was tough as glueBut I'll tell you where I struck the 'ardest tac.2.I was down round Yenda way, killing time from day to dayTill the big sheds started moving further outWhen I struck a bloke by chance that I summed up in a glanceAs a cocky from a vineyard round about.3.Now it seems he picked me too;well, it wasn't hard to do'Cause I had some tongs a-hangin' at the hip,"I got a mob,"he said, "A mob about two hundred headAnd I'll give a ten pun note to have the clip."4.I says, "Right, I'll take the stand" - it meant gettin' in me handAnd by nine o'clock we'd rounded up the mobIn a shed sunk in the ground - yeah, with wine casks all aroundí&s And that was where I started on me job.5.I goes easy for a bit while me hand was gettin' fitAnd by dinner-time I'd done some half a scoreWith the cocky pickin' up and handing me a cupOf pinky after every sheep I shore.6.The cocky had to go away about the seventh dayAfter showing me the kind of cask to useThen I'd do the picking up and manipulate the cupStrolling round them wine casks, just to pick and choose.7.Then I'd stagger to the pen, grab a sheep and start againWith a noise between a hiccup and a sobAnd sometimes I'd fall asleep with me arms around the sheepWorn and weary from me over-arduous job.8.And so six weeks went by, until one day with a sighI pushed the dear old cobbler through the doorGathered in the cocky's pay then staggered on me way˜(Æ#From the hardest bloody shed I ever shore.note:"Recorded at the home of Mr. Jack Davies, a pioneer soldier-settlerof the Leeton District, on the Murrumbidgee, N.S.W.Mr. Daviessays he didn't write "Ard Tac", but adds, "I distinctly rememberbeing sober the day it was written." (Lahey). Tune heard from MikeEves, Sydney FC, 1971.@Australia @sheep @shearing @drinkfilename[ HARDTACplay.exeÿHARDTACJBoct96…+í&(I'VE GOT) BIGGER FISH TO FRY(Tim Woodson)Sittin' on the bank of that muddy Mississippi,Watchin' that river roll by.Got my cane pole up the air, got my bobber throwed way out there.Gonna catch a big catfish to fry.Sittin' on my bucket, got my cooler by my sideGot a big ol' can of worms and a bottle of home made wine.Ain't had a bite in a while, but Lord that's just fine.Gonna watch that sun go down and drink that home made wine.Cho:Lord it don't get no better than this.Sittin' on that riverbank, gonna catch me a big old fish.Lord if you take me, don't take me tonight.Cause I got a big one on the line.Cane pole hit the water, and I dropped that bottle of wine.Fell off my bucket tryin' to reel in that line.¡,˜(Lord it's a big one, could be the biggest of all time.Don't you know before I pulled him in, that big fish snapped my line?Recorded by Wildhorse CreekLyrics Tim Woodson, Music Tim Woodson, Rob Compton, Pat Stevenson (c) 1995.@fishing @foodfilename[ FISHFRYJDJuly01þ.…+

I extracted data using regular expressions, after using a text editor to add some line breaks and spaces in place of some errant unicode characters in the source. This resulted in a passable but very dirty dataset, especially on the detection of titles which then affected the rest of the detection for a field. I then turned to also using regex in the text file to clean it more.

Although the data is now relatively clean, there are still some issues, notably that some lyrics still include notes on the text which were hard to separate from the lyrics themselves due to a lack of consistency.

This data is stored in df_dt:

File records:

Target: 8932 | Inital output (minimal cleaning): 8249 | Output post-cleaning: 8726

Out[19]:
dt_file name lyrics keywords
0 HARDTAC 'ARD TAC 1.I'm a shearer, yes I am, and I've shorn 'em ... [Australia, sheep, shearing, drink]
1 FISHFRY (I'VE GOT) BIGGER FISH TO FRY Sittin' on the bank of that muddy Mississippi,... [fishing, food]
2 JULY12 THE 12TH OF JULY Come pledge again your heart and your hand\nOn... [Irish, peace]
3 AVENUE16 16TH AVENUE From the corners of the country, from the citi... [country]
4 MASS1913 THE 1913 MASSACRE Take a trip with me in nineteen thirteen\nTo C... [union, work, death, Xmas]
... ... ... ... ...
8720 ZEBTURNY ZEB TOURNEY'S GIRL Down in the Tennessee mountains,\nFar from the... [feud]
8721 ZEBRADUN ZEBRA DUN We was camped on the plains at the head of the... [cowboy, animal]
8722 ZENGOSPE ZEN GOSPEL SINGING I once was a Baptist and on each Sunday morn\n... [religion]
8723 ZULIKA ZULEIKA Zuleika was fair to see,\nA fair Persian maide... [marriage, infidelity]
8724 ZULUKING THE ZULU KING Oh the Zulu king with the big nose-ring\nFell ... [camp]

8725 rows × 4 columns

BI second pass¶

I will now return to df_bi and replace the partial DT filenames (derived above from some of the BI filenames) with matching items from our newly-loaded DT data:

I will also update our lookup table for future use:

Out[21]:
bi_file st_file dt_file roud
2 Rc13Hwy NaN NaN 29487
5 FSWB306A NaN MASS1913 17663
8 Hopk112 NaN NaN 29405
11 Hopk039 NaN NaN 29404
12 Hopk046 NaN NaN 29403
... ... ... ... ...
31069 San449 San449 Full NaN 12174
31073 SuSm091B NaN NaN 20694
31075 Dett196 NaN NaN 15233
31076 Fus214 Fus214 Partial NaN 16373
31082 Brne049 NaN NaN 11330

13256 rows × 4 columns

Combine¶

Columns overview¶

Viewing the column names gives me an overview of what to match:

Columns in df_bi: (31087 rows)
name key_name keywords description long_description found_in bi_file st_file dt_file roud
Columns in df_file_lookup: (13256 rows)
bi_file st_file dt_file roud
Columns in df_st: (1229 rows)
key_name key_full_part bi_file version_in_key provenance name lyrics
Columns in df_dt: (8725 rows)
dt_file name lyrics keywords

ST lyrics: merge with additional data from BI¶

Out[23]:
key_name key_full_part bi_file version_in_key provenance name lyrics st_file dt_file roud
0 A Robin, Jolly Robin Complete text(s) Perc1185 A From Percy/Wheatley, I.ii.4, pp. 186-187 A Robyn Jolly Robyn "[F]rom what appears to be the most ancient of... Perc1185 Full HEYROBIN NaN
1 A Robin, Jolly Robin Complete text(s) Perc1185 B From Shakespeare, "Twelfth Night" Act IV, scen... (No title) 71 'Hey, Robin, jolly Robin, 72 Tell me how... Perc1185 Full HEYROBIN NaN
2 A, U, Hinny Bird Partial text(s) StoR160 A From Stokoe/Reay, Songs and Ballads of Norther... NaN A, U, hinny burd; The bonny lass o' Benwell, A... StoR160 Partial NaN 235
3 Adieu to Erin (The Emigrant) Complete text(s) SWMS255 A As found in Gale Huntington, Songs the Whaleme... Adieu to Erin Oh, when I breathed a last adieu, To Erin's an... SWMS255 Full NaN 2068
4 Agincourt Carol, The Complete text(s) MEL51 A From the Bodleian Library (Cambridge), MS. Sel... The Song of Agincourt Deo gracias anglia, Redde pro victoria, 1 Owre... MEL51 Full AGINCRT1 V29347
... ... ... ... ... ... ... ... ... ... ...
1307 Young Strongbow Partial text(s) FlNG210 A From Helen Hartness Flanders, Elizabeth Flande... NaN In olden times there came, A likely youth who ... FlNG210 Partial NaN 4669
1308 Young Waters [Child 94] Complete text(s) C094 A From Percy/Wheatley, II.ii.18, pp. 229-231 NaN one sheet 8vo.", About Yule, quhen the wind bl... C094 Full NaN 2860
1309 Zeb Tourney's Girl [Laws E18] Complete text(s) LE18 A As recorded by Vernon Dalhart, 1926. Transcrib... NaN Down in the Tennessee mountains, Away from the... LE18 Full ZEBTURNY 2249
1310 Zek'l Weep Complete text(s) San449 A From Carl Sandburg, The American Songbag, pp. ... NaN 1 Zek'l weep, Zek'l moan, Flesh come a-creepin... San449 Full NaN 12174
1311 Zion's Sons and Daughters Partial text(s) Fus214 A From Harvey H. Fuson, Ballads of the Kentucky ... NaN See the fountain opened wide, That from sinnin... Fus214 Partial NaN 16373

1312 rows × 10 columns

Then I will check for any backwards references from ST not found in BI.

Out[24]:
key_name key_full_part bi_file version_in_key provenance name lyrics st_file dt_file roud

DT lyrics: merge with additional data from BI¶

Out[25]:
dt_file name lyrics keywords bi_file st_file roud
0 HARDTAC 'ARD TAC 1.I'm a shearer, yes I am, and I've shorn 'em ... [Australia, sheep, shearing, drink] NaN NaN NaN
1 FISHFRY (I'VE GOT) BIGGER FISH TO FRY Sittin' on the bank of that muddy Mississippi,... [fishing, food] NaN NaN NaN
2 JULY12 THE 12TH OF JULY Come pledge again your heart and your hand\nOn... [Irish, peace] NaN NaN NaN
3 AVENUE16 16TH AVENUE From the corners of the country, from the citi... [country] NaN NaN NaN
4 MASS1913 THE 1913 MASSACRE Take a trip with me in nineteen thirteen\nTo C... [union, work, death, Xmas] FSWB306A NaN 17663
... ... ... ... ... ... ... ...
8796 ZEBTURNY ZEB TOURNEY'S GIRL Down in the Tennessee mountains,\nFar from the... [feud] LE18 LE18 Full 2249
8797 ZEBRADUN ZEBRA DUN We was camped on the plains at the head of the... [cowboy, animal] LB16 NaN 3237
8798 ZENGOSPE ZEN GOSPEL SINGING I once was a Baptist and on each Sunday morn\n... [religion] NaN NaN NaN
8799 ZULIKA ZULEIKA Zuleika was fair to see,\nA fair Persian maide... [marriage, infidelity] NaN NaN NaN
8800 ZULUKING THE ZULU KING Oh the Zulu king with the big nose-ring\nFell ... [camp] NaN NaN NaN

8801 rows × 7 columns

Then I will check for any missed backward references to DT from the above ST/BI combination

Out[26]:
dt_file name lyrics keywords bi_file st_file_x roud_x st_file_y roud_y

Merge all lyrics¶

Here is the initial merge of all files with lyrics. The data is still inconsistent but there is more that can be extracted.

Out[27]:
key_name key_full_part bi_file version_in_key provenance name lyrics st_file dt_file roud keywords
0 A Robin, Jolly Robin Complete text(s) Perc1185 A From Percy/Wheatley, I.ii.4, pp. 186-187 A Robyn Jolly Robyn "[F]rom what appears to be the most ancient of... Perc1185 Full HEYROBIN NaN NaN
1 A Robin, Jolly Robin Complete text(s) Perc1185 B From Shakespeare, "Twelfth Night" Act IV, scen... (No title) 71 'Hey, Robin, jolly Robin, 72 Tell me how... Perc1185 Full HEYROBIN NaN NaN
2 A, U, Hinny Bird Partial text(s) StoR160 A From Stokoe/Reay, Songs and Ballads of Norther... NaN A, U, hinny burd; The bonny lass o' Benwell, A... StoR160 Partial NaN 235 NaN
3 Adieu to Erin (The Emigrant) Complete text(s) SWMS255 A As found in Gale Huntington, Songs the Whaleme... Adieu to Erin Oh, when I breathed a last adieu, To Erin's an... SWMS255 Full NaN 2068 NaN
4 Agincourt Carol, The Complete text(s) MEL51 A From the Bodleian Library (Cambridge), MS. Sel... The Song of Agincourt Deo gracias anglia, Redde pro victoria, 1 Owre... MEL51 Full AGINCRT1 V29347 NaN
... ... ... ... ... ... ... ... ... ... ... ...
10108 NaN NaN LE18 NaN NaN ZEB TOURNEY'S GIRL Down in the Tennessee mountains,\nFar from the... LE18 Full ZEBTURNY 2249 [feud]
10109 NaN NaN LB16 NaN NaN ZEBRA DUN We was camped on the plains at the head of the... NaN ZEBRADUN 3237 [cowboy, animal]
10110 NaN NaN NaN NaN NaN ZEN GOSPEL SINGING I once was a Baptist and on each Sunday morn\n... NaN ZENGOSPE NaN [religion]
10111 NaN NaN NaN NaN NaN ZULEIKA Zuleika was fair to see,\nA fair Persian maide... NaN ZULIKA NaN [marriage, infidelity]
10112 NaN NaN NaN NaN NaN THE ZULU KING Oh the Zulu king with the big nose-ring\nFell ... NaN ZULUKING NaN [camp]

10113 rows × 11 columns

3913 lyrics have Roud numbers:

Out[28]:
key_name name version_in_key bi_file st_file roud lyrics
2 A, U, Hinny Bird NaN A StoR160 StoR160 Partial 235 A, U, hinny burd; The bonny lass o' Benwell, A...
3 Adieu to Erin (The Emigrant) Adieu to Erin A SWMS255 SWMS255 Full 2068 Oh, when I breathed a last adieu, To Erin's an...
4 Agincourt Carol, The The Song of Agincourt A MEL51 MEL51 Full V29347 Deo gracias anglia, Redde pro victoria, 1 Owre...
5 All Is Well NaN A FlBr078 FlBr078 Partial 5455 Oh, what is this that steals upon my frame? Is...
6 All Night Long (I) NaN A San448 San448 Full 6703 Paul and Silas, bound in jail, All night long....
... ... ... ... ... ... ... ...
10095 NaN YOUNG REDIN NaN C068 NaN 47 Young Redin's til the hunting gane\nWi' therty...
10097 NaN YOUNG SAILOR CUT DOWN IN HIS PRIME NaN LoF201 NaN 2 One day as I strolled down by the Royal Albion...
10106 NaN ZACK, THE MORMON ENGINEER NaN BRaF444 NaN 4761 Old Zack, he came to Utah, way back in seventy...
10108 NaN ZEB TOURNEY'S GIRL NaN LE18 LE18 Full 2249 Down in the Tennessee mountains,\nFar from the...
10109 NaN ZEBRA DUN NaN LB16 NaN 3237 We was camped on the plains at the head of the...

3913 rows × 7 columns

Inherit 'name'¶

Not all songs have both a 'key_name' and a 'name', so to standardise the data I will inherit missing 'name' data from 'key_name'.

Unify 'name' case¶

Song names from DT are in all capitals, so I also unify the 'name' case using a special titlecase module. Rather than the simple initial capitals provided by the inbuilt title string method, titlecase attempts a smart transformation into title case.

Examples: Comparing title and titlecase:¶

The main problem with title was the mishandling of a letter after an apostrophe (0, 2-6), followed by the stylistically-dubious capitalisation of all "small words" (1, 4). titlecase also has further benefits for other special string sequences, for example proper names beginning "Mc" (2).

Note: a small tweak to the source code was needed to properly handle "O'..." names (5), and I added a regex for Roman numerals (6) as acronyms. It's also possible to speficy a wordlist file for other acronyms but I've not done this (7).

Out[30]:
Example name Using title Method Using titlecase Module
0 AULD MAN'S MARE'S DEID Auld Man'S Mare'S Deid Auld Man's Mare's Deid
1 ALL THROUGH THE ALE All Through The Ale All Through the Ale
2 MCKINLEY'S RAG Mckinley'S Rag McKinley's Rag
3 MACNAMARA'S BAND Macnamara'S Band Macnamara's Band
4 WHA'LL BE KING BUT CHARLIE? Wha'Ll Be King But Charlie? Wha'll Be King but Charlie?
5 O'DOOLEY'S FIRST FIVE O'CLOCK TEA O'Dooley'S First Five O'Clock Tea O'Dooley's First Five O'Clock Tea
6 KEEL ROW III Keel Row Iii Keel Row III
7 ALL AROUND MY HEART (IRA) All Around My Heart (Ira) All Around My Heart (Ira)
Out[32]:
key_name name version_in_key bi_file dt_file roud lyrics
0 A Robin, Jolly Robin A Robyn Jolly Robyn A Perc1185 HEYROBIN NaN "[F]rom what appears to be the most ancient of...
1 A Robin, Jolly Robin (No Title) B Perc1185 HEYROBIN NaN 71 'Hey, Robin, jolly Robin, 72 Tell me how...
2 A, U, Hinny Bird A, U, Hinny Bird A StoR160 NaN 235 A, U, hinny burd; The bonny lass o' Benwell, A...
3 Adieu to Erin (The Emigrant) Adieu to Erin A SWMS255 NaN 2068 Oh, when I breathed a last adieu, To Erin's an...
4 Agincourt Carol, The The Song of Agincourt A MEL51 AGINCRT1 V29347 Deo gracias anglia, Redde pro victoria, 1 Owre...
... ... ... ... ... ... ... ...
10108 NaN Zeb Tourney's Girl NaN LE18 ZEBTURNY 2249 Down in the Tennessee mountains,\nFar from the...
10109 NaN Zebra Dun NaN LB16 ZEBRADUN 3237 We was camped on the plains at the head of the...
10110 NaN Zen Gospel Singing NaN NaN ZENGOSPE NaN I once was a Baptist and on each Sunday morn\n...
10111 NaN Zuleika NaN NaN ZULIKA NaN Zuleika was fair to see,\nA fair Persian maide...
10112 NaN The Zulu King NaN NaN ZULUKING NaN Oh the Zulu king with the big nose-ring\nFell ...

10113 rows × 7 columns

Infer Roud number from other versions¶

Because the Roud index groups similar songs, we can inherit Roud numbers from songs with the same dt_file (matched minus single trailing digit because of variations). However, due to the risk of false positives (example: TITANIC3 is not the same song as TITANIC6), this should be (but is not yet) stored separately.

dt_file BAYOUSAR was assigned roud 10010 and 4139
dt_file COWDENK2 was assigned roud 8209
dt_file KATEHRN3 was assigned roud 555
dt_file SOLDMAI2 was assigned roud 226
dt_file ABDULBL3 was assigned roud 4321
dt_file AGINCRT2 was assigned roud V29347
dt_file GRTCRAZY was assigned roud 15691
dt_file GOODNEW2 was assigned roud 11891
dt_file STARVDT2 was assigned roud 799
dt_file RONDHAT2 was assigned roud 803 plus 3729, 1034
dt_file RONDHAT3 was assigned roud 803 plus 3729, 1034
dt_file DOWNOUTB was assigned roud 18521
dt_file RONDHAT4 was assigned roud 803 plus 3729, 1034
dt_file HEREGRG2 was assigned roud 475
dt_file SVNVIRG2 was assigned roud 127
dt_file AMAZGRA3 was assigned roud 5430
dt_file UNFORTU3 was assigned roud 4859
dt_file AMPHITR2 was assigned roud 301
dt_file HENRMRT2 was assigned roud 104
dt_file ANGLBAND was assigned roud 4268
dt_file AULDLNG5 was assigned roud 13892
dt_file AULDLNG3 was assigned roud 13892
dt_file AULDLNG4 was assigned roud 13892
dt_file DEAFWOM2 was assigned roud 467
dt_file AVONDALE was assigned roud 3250
dt_file BABWOOD4 was assigned roud 288
dt_file BABWOOD5 was assigned roud 288
dt_file LAREDS11 was assigned roud 2
dt_file LAREDST6 was assigned roud 2
dt_file ALANWATR was assigned roud 4260
dt_file GREWILL2 was assigned roud 172
dt_file SWTDUND2 was assigned roud 148
dt_file BNKSBAN3 was assigned roud 889
dt_file BNKSDEE was assigned roud 3847
dt_file BNKSDEE2 was assigned roud 3847
dt_file BNKSLEE was assigned roud 6857
dt_file BRNBRKL3 was assigned roud 4017
dt_file HGHTALM2 was assigned roud 830
dt_file BAYOUSAR was assigned roud 10010 and 4139
dt_file UNFORTU5 was assigned roud 4859
dt_file ABEGGIN2 was assigned roud 286
dt_file BOGIEBL2 was assigned roud 2155
dt_file BESSBANK was assigned roud 566
dt_file BETSYBEL was assigned roud 5211
dt_file BIGROCK was assigned roud 6696
dt_file GYPLAD6 was assigned roud 1
dt_file ROLLCTT3 was assigned roud 2627
dt_file BLKVEL3 was assigned roud 2146 and 3764
dt_file BLKWTR2 was assigned roud 564
dt_file BLRNSTO2 was assigned roud 4800
dt_file BLINDFI2 was assigned roud 7833
dt_file BLOWYE3 was assigned roud 2012
dt_file BLOWYE2 was assigned roud 2012
dt_file BLUEYES was assigned roud 4308 and 18831
dt_file BLUEVEL3 was assigned roud 2146 and 3764
dt_file BOLDARC3 was assigned roud 83
dt_file VANTYGL8 was assigned roud 122
dt_file NEVSAYNO was assigned roud 2903
dt_file BLACKHR2 was assigned roud 1656
dt_file DAILYGR2 was assigned roud 31
dt_file GLENSHE was assigned roud 292
dt_file BOTBAY3 was assigned roud 3267
dt_file BALQUID2 was assigned roud 541
dt_file LAREDS13 was assigned roud 2
dt_file BONLGHT2 was assigned roud 1185
dt_file BRKLYNST was assigned roud 3258
dt_file COWDENK2 was assigned roud 8209
dt_file BRNGIRL2 was assigned roud 180
dt_file MILLDEE4 was assigned roud 310
dt_file BULLYTW2 was assigned roud 4182
dt_file BROOMBES was assigned roud 1623
dt_file BYHUSH was assigned roud 2314
dt_file BYKERHIL was assigned roud 3488
dt_file EASYRID2 was assigned roud 10056
dt_file CALEWES was assigned roud 857
dt_file CALEWE2 was assigned roud 857
dt_file SOUNDOF3 was assigned roud 10398
dt_file CAMFRAN2 was assigned roud 5814
dt_file CNAANLND was assigned roud 5722
dt_file CHARLTT2 was assigned roud 4839
dt_file CASEJON2 was assigned roud 3247
dt_file CHERTRE3 was assigned roud 453
dt_file CHERTRE2 was assigned roud 453
dt_file CHISHLM2 was assigned roud 3438
dt_file CLAUDALL was assigned roud 2245
dt_file CLEMENT2 was assigned roud 9611
dt_file CLERKSA was assigned roud 3855
dt_file CLYDWAT2 was assigned roud 91
dt_file CLYDWAT3 was assigned roud 91
dt_file CLYDWAT2 was assigned roud 91
dt_file CSTPERU2 was assigned roud 1997
dt_file LILSADI2 was assigned roud 780
dt_file CODLIVR2 was assigned roud 4221
dt_file COLDRAW was assigned roud 135
dt_file LOWHOLL9 was assigned roud 484
dt_file COLUMBIA was assigned roud 4843
dt_file COMWRIT2 was assigned roud 381
dt_file COMTHRY2 was assigned roud 5512
dt_file AMAZGRA4 was assigned roud 5430
dt_file COTTNFLD was assigned roud 11662
dt_file SWTJOAN3 was assigned roud 592
dt_file CROCKRY2 was assigned roud 1490
dt_file CROPPIE1 was assigned roud 1030
dt_file CRUELMO4 was assigned roud 263
dt_file ANDRROS3 was assigned roud 623
dt_file CALABAR2 was assigned roud 1079
dt_file CURRKILD was assigned roud 583
dt_file CUTYWRE2 was assigned roud 236
dt_file JOHNPEL2 was assigned roud 1239
dt_file DAILYGR6 was assigned roud 31
dt_file DNTDAVE was assigned roud 2387
dt_file FYVIOLS3 was assigned roud 545
dt_file SUNSCHOL2 was assigned roud 766
dt_file JNGLBLL3 was assigned roud 25804
dt_file DAWNDAY2 was assigned roud 370
dt_file DELIAGO4 was assigned roud 3264
dt_file DELIAGO5 was assigned roud 3264
dt_file DELIAGO6 was assigned roud 3264
dt_file DERBYRM6 was assigned roud 126
dt_file DEVLWIF6 was assigned roud 160
dt_file DEVLWIDW was assigned roud 160
dt_file DEVLMAR2 was assigned roud 1017
dt_file DIAMONJ2 was assigned roud 3585
dt_file THOLDMN2 was assigned roud 3550
dt_file DOMISS2 was assigned roud 4366
dt_file DONTSEL2 was assigned roud 7796
dt_file HTHRMOR2 was assigned roud 375
dt_file RIVTEX2 was assigned roud 4764
dt_file CLEMENT5 was assigned roud 9611
dt_file CANEBREK was assigned roud 10063
dt_file DRUNKDR2 was assigned roud 722
dt_file DUMYLINE was assigned roud 15359
dt_file DUNCBRDY was assigned roud 4177
dt_file LAREDST2 was assigned roud 2
dt_file DOUGTRD3 was assigned roud 321
dt_file EARLY1A2 was assigned roud 12682
dt_file EATWORMS was assigned roud 12764
dt_file JLSLOVR3 was assigned roud 500
dt_file BASKETEG) was assigned roud 377
dt_file MARBON4 was assigned roud 183
dt_file PLNWLOO6 was assigned roud 1922
dt_file ELFKNGT was assigned roud 21
dt_file AUTUMN was assigned roud 1706
dt_file COMRND3 was assigned roud 7052
dt_file FACTRSG2 was assigned roud 572
dt_file JLSLOVR4 was assigned roud 500
dt_file FOXOUT3 was assigned roud 131
dt_file FALSKNT3 was assigned roud 20
dt_file FALSKNT4 was assigned roud 20
dt_file BOLAMKN4 was assigned roud 6
dt_file BONLOVE2 was assigned roud 201
dt_file FLSESIR2 was assigned roud 21
dt_file PIGINEB4 was assigned roud 7322
dt_file FAREWELL was assigned roud 803 plus 3729, 1034
dt_file FAREWELS was assigned roud 803 plus 3729, 1034
dt_file FARWELSY was assigned roud 384
dt_file TARWATH2 was assigned roud 2562
dt_file FRMRDELL was assigned roud 6306
dt_file SOLDMAI2 was assigned roud 226
dt_file FINNWATR was assigned roud 1009
dt_file FINNWAK2 was assigned roud 1009
dt_file FIREBEL2 was assigned roud 813
dt_file FIRELOVE was assigned roud 1780
dt_file GOODMAN was assigned roud 114
dt_file GOODMAN4 was assigned roud 114
dt_file GOLDWED2 was assigned roud 5491
dt_file FOGGDEW6 was assigned roud 558
dt_file FOGGDEW6 was assigned roud 558
dt_file FTPRINTS was assigned roud 2660
dt_file FORSAKLV was assigned roud 466
dt_file FOURLOOM was assigned roud 937
dt_file FOURSTWL was assigned roud 36099
dt_file FOXOUT2 was assigned roud 131
dt_file FOXOUT4 was assigned roud 131
dt_file FOXOUT5 was assigned roud 131
dt_file FRGCORT4 was assigned roud 16
dt_file FRGCORT5 was assigned roud 16
dt_file GAMBLR was assigned roud 3416
dt_file GAMBLR3 was assigned roud 3416
dt_file DARLCOR2 was assigned roud 5723
dt_file KERIMUR2 was assigned roud 4828
dt_file GENTLAN2 was assigned roud 2656
dt_file GEORDI3 was assigned roud 90
dt_file GEORDI5 was assigned roud 90
dt_file GILMORE was assigned roud 53
dt_file GINNYGON was assigned roud 481
dt_file GIRLLFT3 was assigned roud 4497 and 7680 and 23929
dt_file GIRLLF11 was assigned roud 4497 and 7680 and 23929
dt_file GIRLLFT2 was assigned roud 4497 and 7680 and 23929
dt_file GIRLLFT8 was assigned roud 4497 and 7680 and 23929
dt_file GIRLLF12 was assigned roud 4497 and 7680 and 23929
dt_file LAREDS14 was assigned roud 2
dt_file GLASGPG2 was assigned roud 95
dt_file GODREST2 was assigned roud 394
dt_file HEARTDIX was assigned roud 18324
dt_file DOWNRIVE was assigned roud 7677
dt_file VANTYGL3 was assigned roud 122
dt_file VANTYGL6 was assigned roud 122
dt_file GOLDWEDD was assigned roud 5491
dt_file GOODBOY2 was assigned roud 13612
dt_file IRENGDN2 was assigned roud 11681
dt_file ADAMEV2 was assigned roud V37609
dt_file IRENGDN3 was assigned roud 11681
dt_file GRNWALE.NOT was assigned roud 2817 and 15026
dt_file GRTGRNDD was assigned roud 4543
dt_file GRNBROM3 was assigned roud 379
dt_file GRNFLDA2 was assigned roud 2290
dt_file GRENGRAS was assigned roud 279
dt_file GRENGREN was assigned roud 279
dt_file GRRASH2 was assigned roud 2772
dt_file GRNRUSH4 was assigned roud 133
dt_file GREENLDY was assigned roud 347
dt_file GRNSLVS2 was assigned roud V19581
dt_file VANTYGL2 was assigned roud 122
dt_file OLDSHOE2 was assigned roud 362
dt_file GREYCOC2 was assigned roud 179
dt_file GRNRUSH3 was assigned roud 133
dt_file GYPLADD4 was assigned roud 1
dt_file NCNTRYM3 was assigned roud 1367
dt_file HRDTIME2 was assigned roud 2659
dt_file OVRHILL7 was assigned roud 8460
dt_file FAIRFLR3 was assigned roud 25
dt_file HEXHMLS2 was assigned roud 3182
dt_file GOODLOKN was assigned roud 3340
dt_file GLASGPG3 was assigned roud 95
dt_file HIELND2 was assigned roud 4691
dt_file HITLERB2 was assigned roud 10493
dt_file SUFFMRC4 was assigned roud 246
dt_file BITWITH2 was assigned roud 452
dt_file HOMESTEA was assigned roud 7744
dt_file HOUSCARN was assigned roud 14
dt_file HUSHLIL2 was assigned roud 470
dt_file HUSHLIL2 was assigned roud 470
dt_file IMAROVR2 was assigned roud 3135
dt_file FLSEBRD9 was assigned roud 154
dt_file KNOWHER2 was assigned roud 1645 and 5701
dt_file SHEEPSNG was assigned roud 879
dt_file TAVTOWN2 was assigned roud 60
dt_file LAREDS17 was assigned roud 2
dt_file JUSTFAC2 was assigned roud 3127
dt_file LUMBERJK was assigned roud 591 and 7088
dt_file MOONSHI3 was assigned roud 414
dt_file IMAROVER was assigned roud 3135
dt_file IRSHLAB2 was assigned roud 1137
dt_file LADAMER was assigned roud 18316
dt_file RYEWHIS2 was assigned roud 941
dt_file INPINE2 was assigned roud 3421
dt_file INTOAIR was assigned roud 15440
dt_file BIGROCK4 was assigned roud 6696
dt_file CURRKIL2 was assigned roud 583
dt_file SIRHUGH4 was assigned roud 73
dt_file SIRHUGH5 was assigned roud 73
dt_file TIPRARY was assigned roud 11235
dt_file SYMEOVR2 was assigned roud 9621
dt_file SYMEOVR3 was assigned roud 9621
dt_file FLATRVR2 was assigned roud 642
dt_file FLATRVR3 was assigned roud 642
dt_file GLENKIN2 was assigned roud 145
dt_file JACKROWL was assigned roud 268
dt_file JACOBLAD was assigned roud 2286
dt_file JMCONNL2 was assigned roud 12495
dt_file JAMFOYE3 was assigned roud 1941
dt_file KATEHRN3 was assigned roud 555
dt_file JESSJAM1 was assigned roud 2240
dt_file BLUETAI2 was assigned roud 1274
dt_file JNGLBLAU was assigned roud 25804
dt_file JNGLBLL2 was assigned roud 25804
dt_file JOHNAND1 was assigned roud 16967
dt_file JOHNAND5 was assigned roud 16967
dt_file JBARLEY2 was assigned roud 164
dt_file JOHNAND6 was assigned roud 16967
dt_file JOHNAND7 was assigned roud 16967
dt_file JOHNHIEL was assigned roud 650
dt_file DUNDER2 was assigned roud 4461
dt_file MARBONE6 was assigned roud 183
dt_file JOLLPLO2 was assigned roud 186
dt_file KAFOOZL2 was assigned roud 10135
dt_file PEGGORD2 was assigned roud 2280
dt_file KATYCRU2 was assigned roud 1645 and 5701
dt_file KEELROW3 was assigned roud 3059
dt_file PRDMARG2 was assigned roud 37
dt_file SPRINGH3 was assigned roud 2713
dt_file LADYFRN4 was assigned roud 487
dt_file GRNSLVS3 was assigned roud V19581
dt_file SILKIE3 was assigned roud 197
dt_file LAIDLEY2 was assigned roud 3968
dt_file ELFKNGT2 was assigned roud 21
dt_file JOHNAND2 was assigned roud 16967
dt_file CURRKIL3 was assigned roud 583
dt_file LNDLDYDT was assigned roud V33309
dt_file LASTROSE was assigned roud 13861
dt_file LAVNDER3 was assigned roud 3483
dt_file FRANJON3 was assigned roud 254
dt_file LEAVLIV2 was assigned roud 9435
dt_file AENICHT was assigned roud 135
dt_file PATGAME2 was assigned roud 18464
dt_file HOLEBCK2 was assigned roud 17845
dt_file LIFERAIL was assigned roud 13933
dt_file BRWNEYED was assigned roud 17030
dt_file LYDIAPN2 was assigned roud 8368
dt_file LYDIAPN3 was assigned roud 8368
dt_file LIMERAKE was assigned roud 3018
dt_file LAREDS18 was assigned roud 2
dt_file LTLBLSS2 was assigned roud 7788
dt_file LILMOHE1 was assigned roud 275
dt_file LITMOSE2 was assigned roud 3546
dt_file LITTLEPD was assigned roud 1930
dt_file LTTLSTCH was assigned roud 1937
dt_file KEACHCR3 was assigned roud 120
dt_file ASHGROV4 was assigned roud 24988
dt_file ASHGROV3 was assigned roud 24988
dt_file LOCHLMD3 was assigned roud 9598
dt_file LAREDST3 was assigned roud 2
dt_file TURTDOV3 was assigned roud 49
dt_file LNGTRAIL was assigned roud 23525
dt_file LONGTIME was assigned roud 5732
dt_file LORDBAT5 was assigned roud 40
dt_file LRDBEIC2 was assigned roud 40
dt_file LORDGRG3 was assigned roud 49
dt_file BRWNGRL3 was assigned roud 4
dt_file LVNGNAN2 was assigned roud 563
dt_file VANTYGL4 was assigned roud 122
dt_file VANTYGL7 was assigned roud 122
dt_file LOWHOL10 was assigned roud 484
dt_file BROKEBN2 was assigned roud 24846
dt_file CALABAR3 was assigned roud 1079
dt_file MARRYNO was assigned roud 1403
dt_file MARTINMA was assigned roud 2173
dt_file MARYANN2 was assigned roud 4438
dt_file MARYLAM3 was assigned roud 7622
dt_file MARYLAMB was assigned roud 7622
dt_file MARTINDL was assigned roud 2173
dt_file MARYSOMR was assigned roud 2496
dt_file MARYLAND was assigned roud 7622
dt_file MATTHYL was assigned roud 2880
dt_file MATTHYL2 was assigned roud 2880
dt_file MATTIE was assigned roud 52
dt_file MAYMRNHM was assigned roud 5405
dt_file HARLCH2 was assigned roud 24790
dt_file MERMAID4 was assigned roud 124
dt_file MERMAID2 was assigned roud 124
dt_file MICHAELR was assigned roud 11975
dt_file MILWAUKE was assigned roud 3255
dt_file MRSHDRK2 was assigned roud 9753
dt_file MOLLYMA2 was assigned roud 16932
dt_file FATALSN2 was assigned roud 175
dt_file MOONSHIN was assigned roud 414
dt_file MOTHR was assigned roud 16113
dt_file BTTLOVE2 was assigned roud 5462
dt_file MTDEW was assigned roud 938
dt_file MTMDOW2 was assigned roud 3240
dt_file MYBONNI2 was assigned roud 1422
dt_file LAMECRN was assigned roud 13622
dt_file DEARCOM2 was assigned roud 411 and 459
dt_file GDOLDMN2 was assigned roud 240
dt_file JOHNLAD2 was assigned roud 6131
dt_file BUCKBRN2 was assigned roud 934
dt_file MYSWEETH was assigned roud 4756
dt_file GIRLLFT9 was assigned roud 4497 and 7680 and 23929
dt_file CALTONW2 was assigned roud 883
dt_file MORNDEW4 was assigned roud 11
dt_file GUNDAGRD was assigned roud 10221 AND 9121
dt_file ONOJOHN2 was assigned roud 146
dt_file NOAHARK was assigned roud 318
dt_file SAMHALL3 was assigned roud 369
dt_file NBDYKNEW was assigned roud 5438
dt_file LAREDS10 was assigned roud 2
dt_file REILLY2 was assigned roud 1161
dt_file SHEARNA3 was assigned roud 4845
dt_file OATSBEAN was assigned roud 1380
dt_file LONEPRA2 was assigned roud 631
dt_file OHDEATH was assigned roud 4933
dt_file HEREGRO2 was assigned roud 475
dt_file FENIANGN was assigned roud 4531
dt_file GIRLLF10 was assigned roud 4497 and 7680 and 23929
dt_file CLOKWIN2 was assigned roud 241
dt_file KNGCOLE3 was assigned roud 1164
dt_file LOGCABI2 was assigned roud 7376
dt_file OLDMAID4 was assigned roud 802
dt_file OLDSHOE3 was assigned roud 362
dt_file OLDSMOK3 was assigned roud 414
dt_file SOWMEAS2 was assigned roud 17759
dt_file STEPSTONE was assigned roud 7453
dt_file OVRHILL8 was assigned roud 8460
dt_file OLDSMOK4 was assigned roud 414
dt_file OLDSMOK5 was assigned roud 414
dt_file ONEMRDY2 was assigned roud 704
dt_file LAREDS12 was assigned roud 2
dt_file ONCHRS2 was assigned roud V26738
dt_file LOWHOL11 was assigned roud 484
dt_file OVRHILL2 was assigned roud 8460
dt_file OVRHILL3 was assigned roud 8460
dt_file SKYEBOT2 was assigned roud 3772
dt_file CHARLOVR was assigned roud 729
dt_file OVRHILL6 was assigned roud 8460
dt_file PADDO was assigned roud 4695
dt_file PADRAIL2 was assigned roud 208
dt_file PADRAIL4 was assigned roud 208
dt_file PAPERPI2 was assigned roud 573
dt_file PATSPENS2 was assigned roud 41
dt_file PADRAIL3 was assigned roud 208
dt_file RDDLSNG4 was assigned roud 330 and 36
dt_file ROOTHOG4 was assigned roud 4292
dt_file PIGINEB3 was assigned roud 7322
dt_file LAREDST9 was assigned roud 2
dt_file PLNWLOO5 was assigned roud 1922
dt_file PLNWLOO4 was assigned roud 1922
dt_file PLESDELT was assigned roud 660
dt_file PLESDEL2 was assigned roud 660
dt_file PLOOLAD2 was assigned roud 5138
dt_file PLOUGHB2 was assigned roud 2538
dt_file PLOUGHMN was assigned roud 2538
dt_file PLOUGHM2 was assigned roud 2538
dt_file PLOUGHM3 was assigned roud 2538
dt_file POLLVON3 was assigned roud 166
dt_file ELLNSMT3 was assigned roud 448
dt_file POORLIL2 was assigned roud 10310
dt_file DEADHOR2 was assigned roud 513
dt_file OLDMAID3 was assigned roud 802
dt_file JOLLHANG was assigned roud 1048
dt_file PRETBABY was assigned roud 288
dt_file LILBIRD was assigned roud 5742
dt_file KEACHCR4 was assigned roud 120
dt_file PRETSAR5 was assigned roud 417
dt_file PRETSAR4 was assigned roud 417
dt_file PRETSAR2 was assigned roud 417
dt_file PRETSAR3 was assigned roud 417
dt_file HANGMAN3 was assigned roud 896
dt_file PUDDYWL3 was assigned roud 16
dt_file PUDDYWEL was assigned roud 16
dt_file PUSHBOYS was assigned roud 8088
dt_file RAGECANO was assigned roud 735
dt_file RRBILLKT was assigned roud 4181
dt_file REDRIVA2 was assigned roud 756
dt_file ELFKNGT3 was assigned roud 21
dt_file REDRIVPL was assigned roud 756
dt_file REYNRDFX was assigned roud 2349
dt_file RCHMRCH2 was assigned roud 536
dt_file RDDLSNG2 was assigned roud 330 and 36
dt_file BONBROQ2 was assigned roud 161
dt_file BONBROQ3 was assigned roud 161
dt_file KEACHCR2 was assigned roud 120
dt_file RISESHEP was assigned roud 11968
dt_file RHPEDLRS was assigned roud 333
dt_file ROBHDTH2 was assigned roud 3299
dt_file ROCKBABY was assigned roud 3024
dt_file VIRGIBN4 was assigned roud 27
dt_file ROCKYMNT was assigned roud 277
dt_file RMCORLY3 was assigned roud 5279
dt_file RMCORLY2 was assigned roud 5279
dt_file ROLLCTT2 was assigned roud 2627
dt_file ROLLCHR2 was assigned roud 3632
dt_file ROOTHOG5 was assigned roud 4292
dt_file ROSEBRIR was assigned roud 1796
dt_file ROSEBUDD was assigned roud 812
dt_file RYEWHISx was assigned roud 941
dt_file GOLDRIVR was assigned roud 7405
dt_file PLESDEL3 was assigned roud 660
dt_file SAILTAI2 was assigned roud 917
dt_file SAILBORD was assigned roud 314
dt_file SAMBAS2 was assigned roud 2244
dt_file SAMHALL2 was assigned roud 369
dt_file SNTYANN3 was assigned roud 207
dt_file ELFKNGT4 was assigned roud 21
dt_file STWBLHR5 was assigned roud 456
dt_file SEVENOL2 was assigned roud 10227
dt_file COLDRAI2 was assigned roud 135
dt_file SHULARN5 was assigned roud 911
dt_file UNFORTU4 was assigned roud 4859
dt_file SILVDAG was assigned roud 22620 and 22621
dt_file SNGLGRL4 was assigned roud 436
dt_file TITANIC9 was assigned roud 4173
dt_file SINERMN3 was assigned roud 3408
dt_file STWBLHR4 was assigned roud 456
dt_file SKYEBOT3 was assigned roud 3772
dt_file FALSKNT5 was assigned roud 20
dt_file SOLONGI2 was assigned roud 15161
dt_file GRTWHEE2 was assigned roud 10237
dt_file SOLDMARN was assigned roud 226
dt_file SOLDBOY3 was assigned roud 1917
dt_file COCKADE2 was assigned roud 191
dt_file BRITGRE2 was assigned roud 11231?
dt_file BACKWODS was assigned roud 641
dt_file SONSLIB was assigned roud 596
dt_file SPRINGHI was assigned roud 2713
dt_file LAREDS16 was assigned roud 2
dt_file STWBLHR3 was assigned roud 456
dt_file STILILO2 was assigned roud 654
dt_file STRFORB2 was assigned roud 20764
dt_file LAREDS15 was assigned roud 2
dt_file LAREDS20 was assigned roud 2
dt_file SWEETBYE was assigned roud 3234
dt_file VANTYGL5 was assigned roud 122
dt_file SWTROSIE was assigned roud 9560
dt_file SWTVILT was assigned roud 10232 and 10404
dt_file CARCROW3 was assigned roud 891
dt_file TARYTRO2 was assigned roud 427
dt_file TARYTRO3 was assigned roud 427
dt_file TEDONEIL was assigned roud 5207
dt_file POORLOU was assigned roud 4643
dt_file THISLAN2 was assigned roud 16378
dt_file THREEBL2 was assigned roud 3753
dt_file THREEBRO was assigned roud 3753
dt_file TIMEHARD was assigned roud 16072
dt_file TITANIC7 was assigned roud 4173
dt_file TITANIC9 was assigned roud 4173
dt_file TITANIC6 was assigned roud 4173
dt_file MORROW1 was assigned roud 9554
dt_file CANONBL2 was assigned roud 4759
dt_file TOMMYHL2 was assigned roud 481
dt_file GLORYPE2 was assigned roud 19921
dt_file CLOSEWND was assigned roud 15986
dt_file GYPLAD5 was assigned roud 1
dt_file LAREDST7 was assigned roud 2
dt_file JESSJAM2 was assigned roud 2240
dt_file TURKSTR2 was assigned roud 4247
dt_file TURNYEM2 was assigned roud 23557
dt_file TUTRLDOV was assigned roud 49
dt_file TURTDOV2 was assigned roud 49
dt_file THRERAV7 was assigned roud 747?
dt_file THRERAV7 was assigned roud 747?
dt_file TWOSIS6 was assigned roud 8
dt_file TWOBROCW was assigned roud 38
dt_file TWOSIS12 was assigned roud 8
dt_file TWOSIS13 was assigned roud 8
dt_file TWOSIS7 was assigned roud 8
dt_file UNDRAPRN was assigned roud 899
dt_file LAREDST5 was assigned roud 2
dt_file DIXIELN2 was assigned roud 8231
dt_file GOTOSEA2 was assigned roud 644
dt_file CRUELMO5 was assigned roud 263
dt_file PLNWLOO4 was assigned roud 1922
dt_file LAREDST4 was assigned roud 2
dt_file SWTJOAN2 was assigned roud 592
dt_file COCKADE3 was assigned roud 191
dt_file DOITNOW2 was assigned roud 1401
dt_file LAREDS19 was assigned roud 2
dt_file BARNBINN was assigned roud 4704
dt_file DEVLWIF5 was assigned roud 160
dt_file REYNFOX2 was assigned roud 1868 and 190
dt_file JACBITE2 was assigned roud 5517
dt_file SALGARD3 was assigned roud 3819
dt_file DAILYGR4 was assigned roud 31
dt_file DAILYGR5 was assigned roud 31
dt_file LAREDST8 was assigned roud 2
dt_file WRCK1262 was assigned roud 7128
dt_file WILLIWI4 was assigned roud 64
dt_file YARROW4 was assigned roud 13
dt_file XMASGOO2 was assigned roud 167
dt_file WASSCORN was assigned roud 209
dt_file YARROW5 was assigned roud 13
dt_file WHENOVR2 was assigned roud 3446
dt_file WALLABBY was assigned roud 7483
dt_file WILDMTH2 was assigned roud 541
dt_file WILDBIKE was assigned roud 2246
dt_file WARGRMN3 was assigned roud 904
dt_file WARLIKES was assigned roud 690
dt_file WE3KING2 was assigned roud 24751
dt_file WEARGRE2 was assigned roud 3278
dt_file WHAKING was assigned roud 729
dt_file WHNCORT was assigned roud 4275 and 2977
dt_file WHCHSID2 was assigned roud 15159
dt_file WHITFIS2 was assigned roud 3888
dt_file WIDWSTMO was assigned roud 228
dt_file WILDBIL2 was assigned roud 2246
dt_file WILILAD2 was assigned roud 220
dt_file WINNIP2 was assigned roud 8348
dt_file WINTERI was assigned roud 1942
dt_file WINTER was assigned roud 1942
dt_file WRECK972 was assigned roud 777
dt_file YLLOWTX3 was assigned roud 10405
dt_file YLLOWTX2 was assigned roud 10405
dt_file YNGPEG2 was assigned roud 3875

Final merged dataset:¶

Finally, I will drop exact duplicates, clean the lyrics of unicode control characters, and reindex the data. For now, lyrics without Roud numbers will stay in as they can also eventually be clustered. This will be the full dataset that I will pickle and use to evaluate the results of the clustering, retaining the current index numbers as a reference in order to join the data back up after clustering.

Out[38]:
index key_name name version_in_key bi_file dt_file roud lyrics
0 0 A Robin, Jolly Robin A Robyn Jolly Robyn A Perc1185 HEYROBIN NaN "[F]rom what appears to be the most ancient of...
1 1 A Robin, Jolly Robin (No Title) B Perc1185 HEYROBIN NaN 71 'Hey, Robin, jolly Robin, 72 Tell me how...
2 2 A, U, Hinny Bird A, U, Hinny Bird A StoR160 NaN 235 A, U, hinny burd; The bonny lass o' Benwell, A...
3 3 Adieu to Erin (The Emigrant) Adieu to Erin A SWMS255 NaN 2068 Oh, when I breathed a last adieu, To Erin's an...
4 4 Agincourt Carol, The The Song of Agincourt A MEL51 AGINCRT1 V29347 Deo gracias anglia, Redde pro victoria, 1 Owre...
... ... ... ... ... ... ... ... ...
9968 10108 NaN Zeb Tourney's Girl NaN LE18 ZEBTURNY 2249 Down in the Tennessee mountains,\nFar from the...
9969 10109 NaN Zebra Dun NaN LB16 ZEBRADUN 3237 We was camped on the plains at the head of the...
9970 10110 NaN Zen Gospel Singing NaN NaN ZENGOSPE NaN I once was a Baptist and on each Sunday morn\n...
9971 10111 NaN Zuleika NaN NaN ZULIKA NaN Zuleika was fair to see,\nA fair Persian maide...
9972 10112 NaN The Zulu King NaN NaN ZULUKING NaN Oh the Zulu king with the big nose-ring\nFell ...

9973 rows × 8 columns

The finished lyrics dataset has:

  • 9973 songs (with lyrics)
  • 4350 song lyrics with Roud numbers
  • 2893 unique roud numbers*
  • 1259 songs with roud numbers that have three or more songs attached*

* not accounting for multiple Roud numbers

Out[39]:
9973
Out[40]:
4350
Out[41]:
2893
Out[42]:
1259

Some 'roud' fields contain multiple numbers which still need to be split. I will decide later how to do this when labelling.

Data preprocessing and embedding¶

For the purposes of clustering I will prepare a smaple dataset with only lyrics whose Roud numbers appear 3 or more times in the data. I also remove columns for clarity:

Out[44]:
name bi_file dt_file roud lyrics
4 The Song of Agincourt MEL51 AGINCRT1 V29347 Deo gracias anglia, Redde pro victoria, 1 Owre...
10 O Falmouth Is a Fine Town LK43A AMBLTOWN 269 Text supplied by Don Duncan. Reportedly writte...
28 Atisket, Atasket (I Sent a Letter to My Love) BAF806A NaN 13188 I wrote a letter to my love; I carried water i...
29 Atisket, Atasket (I Sent a Letter to My Love) BAF806A NaN 13188 And the night before; if he does again to-nigh...
30 Atisket, Atasket (I Sent a Letter to My Love) BAF806A NaN 13188 A green leather basket; I wrote a letter to my...
... ... ... ... ... ...
9942 Young Barbour C100 WILLIWI2 64 'Twas of a lady in the west counteree,\nShe wa...
9948 Young Hunting C068 YNGHUNT 47 It happened on one evening late,\nAs the maid ...
9949 Young Hunting 2 C068 YNGHUNT2 47 A lady stood in her bower door,\nIn her bower ...
9955 Young Redin C068 YNGHUNT5 47 Young Redin's til the hunting gane\nWi' therty...
9957 Young Sailor Cut Down in His Prime LoF201 YNGMNPRM 2 One day as I strolled down by the Royal Albion...

1259 rows × 5 columns

Out[45]:
name bi_file dt_file roud lyrics

Embedding and vectorisation¶

Now that I have a dataset of lyrics, these must be transformed into machine-readable data so that they can be clustered. There are various ways of doing this transformation, and the process is referred to as text vectorisation or embedding.

Although often used interchangeably, simple vectorisation by index and vector space embedding have differences that are relevant to the task at hand. Vectorisation focuses on the discrete indexing and counting of tokens, whereas embeddings represent tokens in a continuous vector space which captures their interrelationships.

  • Vectorisation gives each token an index and a vector indicating its frequency. Frequency can be measured relative to different contexts (eg a simple count, or relative to its frequency in the corpus) by different models. Examples: n-gram, Bag-of-Words models, TF-IDF, or Count Vectorization
  • Embeddings, on the other hand, use context information to place words in a multidimensional vector space representing the entire input data corpus. Proximity in this space indicates semantic and contextual closeness. Examples: Word2Vec, GloVe, transformer-based models (eg BERT, GPT, T5) or RNN-based models (eg LSTM, GRU, Hierarchical Attention Networks)

I will use true embeddings for my data, given that the structure of a song is important to its similarity, not only word frequency or semantic similarity.

Embeddings¶

There are many different embeddings to choose from, with new models being released regularly. To select a model I considered the input data (lyrics) and the task (clustering) along with model performance benchmarks. The Massive Text Embeddings Benchmark (MTEB) leaderboard at huggingface evaluates various embedding models on various datasets and tasks, including clustering. Overall, language models are not yet as effective at clustering as they are at other tasks like classification.

The best embeddings tested for clustering are currently:

  • gte (large and base) from BAAI - a BERT-based model
  • bge (large and base) from Alibaba DAMO Academy - a BERT-based model
  • text-embedding-ada-002 from OpenAI
  • instructor (large) from HKU NLP - a T5-based model

The authors of the original MTEB paper suggest that the front-runner at the time of publishing, MPNet, may have performed better due to its diversity of training data sets and resulting ability to create well-spaced embeddings in types of text it had not previously encountered. If true, I imagine this factor to be particularly relevant for lyrics because their structure does not match normal human-generated language.

Embedding with instructor-large and SentenceTransformer¶

Instead of doing the embeddings on only the subset of songs sharing Roud numbers, I give embeddings to the whole corpus of lyrics in df_lyrics2. These are fed one-at-a-time and so the corpus is not actually considered as a whole (as with TF-IDF), but the high performance of large language embedding models in 2023 means this may not be necessary.

First I calculate the embeddings, which takes several hours.

Then I append the embeddings to the main dataset.

Each embedding has 768 dimensions, meaning that it cannot be graphically visualised without some transformation, eg feature reduction.

Out[49]:
768

I will, however, use all the dimensions for the clustering itself.

Clustering¶

Model selection¶

I want to cluster my lyrics into non-discrete clusters based on the high-dimensional emebeddings data. This led me to consider the following approaches and potential models:

Soft Clustering: assigns data points to multiple clusters, with varying degrees of membership

  • Centroid-based: Fuzzy C-Means: like K-Means, but with flexible cluster membership
    • Disadvantages:
      • Requires specifying the number of clusters
      • Assumes clusters are uniform blobs
      • Sensitive to initial cluster placement
  • Gaussian Mixture Models (GMM): assigns probabilities to data points' cluster memberships using Expectation-Maximisation
    • Disadvantages:
      • Assumes data points in a cluster will have a Gaussian distribution, and my clusters might have too few members to measure this. It is also unclear to me if this would be true for lyrics data
      • Sensitive to initial cluster placement (is initialised similarly to, and often using, K-Means)

Hierarchical Clustering: puts data points into a hierarchical tree of clusters

  • Hierarchical Agglomerative Clustering (HAC) starts from initial points and agglomorates other points to them (compare: divisive clustering)
    • Advantages:
      • Could provide insights into how lyrics versions are connected and descended/derived from others
    • Disadvantages:
      • Computationally intensive
      • Produces discrete hierarchies, whereas lyrics may be more like a web

Density-based Clustering: creates clusters by detecting areas of increased density in the feature space

  • HDBSCAN: based on DBSCAN, but creates a hierarchical tree of possible clusters arranged by density, starting out from having the only the most densely arranged points clustered, and splitting the tree where points have ambiguous membership. Selecting a cutoff level in the tree allows for a mix of cluster densities in the final configuration.
    • Advantages:
      • Soft cluster assignment is possible https://hdbscan.readthedocs.io/en/latest/soft_clustering.html
      • Automatically determines the number of clusters
      • Handles outliers and unevenly distributed data
      • Can return a medoid which could, if it is a real data point(?), serve as an exemplary point for that cluster

Affinity Propagation Clustering: clusters data points by passing messages between pairs of points about group membership preferences

  • AffinityPropagation
    • Advantages:
      • Captures non-linear relationships between data, potentially good for finding less-obvious connections between texts
      • Automatically determines the number of clusters
      • Returns an examplar of each cluster
    • Disadvantages:
      • Requires careful parameter tuning for optimal results.
      • Hard clustering only (a fuzzy version [SCAP] exists but not as a Python library)

These are summarised in the table below. HDBSCAN was the model that fulfilled the most requirements.

Model method predicts number
of clusters
soft/fuzzy cluster
membership
multi- dimensional
data
nonspherical or
uneven density clusters
Fuzzy
C-Means
Centroid ❌ ✔️ ❌ ❌
GMM Probability;
Expectation Maximisation
❌ ✔️ ❌ ✔️
HDBSCAN Density ✔️ ✔️
experimental
✔️ ✔️
Affinity
propagation
"message passing" ✔️ ❌ ✔️? ✔️

Method 1: Clustering with the HDBSCAN model from sklearn¶

HDBSCAN is a new addition to sklearn in version 1.3, released June 2023, so I have to update before I can import the modules needed.

I will run the clustering initially on the sample dataset (the songs with Roud numbers having at least 3 songs attached) and also set the min_cluster_size size to 3. This means a total of 1259 embeddings:

Out[51]:
lyrics lyric_embed_instructor
4 Deo gracias anglia, Redde pro victoria, 1 Owre... [-2.9962948e-05, -0.009317334, -0.017600924, 0...
10 Text supplied by Don Duncan. Reportedly writte... [-0.034601513, -0.030833337, -0.03212969, 0.04...
28 I wrote a letter to my love; I carried water i... [-0.02106121, 5.913962e-05, -0.011913068, 0.01...
29 And the night before; if he does again to-nigh... [-0.016591273, 0.004053459, -0.042086408, 0.03...
30 A green leather basket; I wrote a letter to my... [-0.021746654, 0.013723225, -0.008460247, 0.02...
... ... ...
9942 'Twas of a lady in the west counteree,\nShe wa... [-0.02995709, -0.00053723896, -0.014363075, 0....
9948 It happened on one evening late,\nAs the maid ... [-0.041073702, -0.010606567, -0.022429088, 0.0...
9949 A lady stood in her bower door,\nIn her bower ... [-0.05010883, 0.017027983, -0.017928332, 0.034...
9955 Young Redin's til the hunting gane\nWi' therty... [-0.024981715, -0.017257249, -0.045718156, 0.0...
9957 One day as I strolled down by the Royal Albion... [-0.03969955, 0.0018224359, -0.02169303, 0.015...

1259 rows × 2 columns

Fit the model¶

Out[52]:
HDBSCAN(min_cluster_size=3, store_centers='medoid')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HDBSCAN(min_cluster_size=3, store_centers='medoid')

Now each data point has been assigned a cluster label and probabilities:

Out[53]:
0 probabilities
0 21 1.0
1 -1 0.0
2 5 1.0
3 5 1.0
4 5 1.0
... ... ...
1254 128 1.0
1255 -1 0.0
1256 -1 0.0
1257 -1 0.0
1258 76 1.0

1259 rows × 2 columns

Examine clusters¶

Unfortunately, over half of the data points were simply assigned -1, meaning the data was considered 'noisy' by HDBSCAN. (None were assigned -2 or -3 for invalid data.)

Out[54]:
0 probabilities
1 -1 0.0
7 -1 0.0
8 -1 0.0
9 -1 0.0
11 -1 0.0
... ... ...
1247 -1 0.0
1248 -1 0.0
1255 -1 0.0
1256 -1 0.0
1257 -1 0.0

749 rows × 2 columns

The probabilities of cluster membership among items assigned to a cluster are mainly 1, but some are lower:

Out[55]:
count    117.000000
mean       0.908851
Name: probabilities, dtype: float64

Dimension reduction for cluster visualisation¶

If I try to plot the result, there are too many dimensions to see anything useful:

Figure

Although no dimension reduction was initially done for calculating the clusters, we can still use it to visualise the data points in 2 dimensions. There are various dimensionality reduction techniques. Here are two popular ones:

  • T-SNE is better at preserving clusters.
  • PCA is better at preserving distances and therefore the size of differences.

Because they necessarily result in a loss of data, it's important to check that the chosen technique is helping in the particular context. Let's compare them on the data:

Some visualisation techniques automatically perform this feature/dimension reduction, but for now I will calculate them separately and append them to the sample dataset. Now a sample record looks like this:

index                                                                   114
key_name                                      Blue-Tail Fly, The [Laws I19]
name                                                         Jim Crack Corn
version_in_key                                                            A
bi_file                                                                LI19
dt_file                                                            BLUETAIL
roud                                                                   1274
lyrics                    When I was young I us'd to wait, On Massa and ...
roud_count                                                              3.0
lyric_embed_instructor    [-0.018739486, -0.010365271, -0.058310274, 0.0...
tsne                               [4.3185577392578125, -49.74964141845703]
pca                              [0.0858643501996994, -0.07742226868867874]
Name: 110, dtype: object
T-SNE:¶
Figure
PCA:¶
Figure

T-SNE seems to produce the best results, but I need to inspect the data to be sure that the clusters reflect the lyrics.

Compare clusters to Roud numbers¶

Here are the counts of unique song names and Roud numbers per cluster label. Some clusters have multiple Roud numbers in them:

Out[62]:
name roud probabilities
0
64 1 1 3
78 3 1 1
77 6 1 4
76 5 1 2
120 3 1 1
... ... ... ...
116 6 4 3
107 10 5 6
128 7 5 4
112 7 5 4
-1 738 241 1

131 rows × 3 columns

Let's graph only the songs that got a valid label (not -1):

Figure

Most cluster labels, however, were assigned to exactly three songs (our minimum set for both cluster size and minimum Roud sample size) all of the same Roud number. That suggests at least a partially successful clustering.

Let's look at one of the 24 clusters containing songs with differing Roud numbers and see if it's clear why this happened:

Out[65]:
name roud probabilities
cluster_label
128 7 5 4
112 7 5 4
107 10 5 6
116 6 4 3
28 6 3 4
48 6 3 4
75 4 3 2
102 4 2 2
126 4 2 2
125 8 2 6
119 3 2 1
111 4 2 2
110 4 2 2
101 4 2 2
39 2 2 3
94 3 2 1
84 3 2 1
82 4 2 2
81 4 2 3
74 5 2 3
63 3 2 1
49 2 2 1
46 5 2 2
129 5 2 2
Figure

Cluster 107 is particularly diffuse. Upon an initial inspection of these songs, they are all immediately recognisable as nautical. This is a good demonstration of the semantic embedding.

Out[67]:
key_name name roud lyrics 0 probabilities
30 Captain Glen/The New York Trader (The Guilty S... Captain Glen's Unhappy Voyage to New Barbary 478 There was a ship, and a ship of fame, Launched... 107 1.000000
856 NaN The New York Trader 478 To a New York trader I dld belong.\nShe was we... 107 1.000000
1241 NaN William Glen 478 There was a ship and a ship of fame .\nLaunch'... 107 1.000000
1119 NaN The Titanic 6 4173 You feeling hearted Christians, oh, listen to ... 107 0.945796
31 Captain Ward and the Rainbow [Child 287] Captain Ward and the Rainbow [Child 287] 224 Strike up, ye lust gallants, With music beat o... 107 0.928249
372 NaN The Calabar 1079 Come all ye dry-land sailors and listen to my ... 107 0.921126
833 NaN The Mermaid 124 Twas Friday morn when we set sail\nAnd we were... 107 0.916683
835 NaN The Mermaid (4) 124 As I sailed from Galway in service to the Quee... 107 0.916683
1031 NaN Sinking of the Titanic 4173 It was on the 10th of April on a sunny afterno... 107 0.914898
1117 NaN Titanic (7) 4173 It was midnight on the sea\nBand playing, "nea... 107 0.914898

All the lyrics with 100% probability or close are variations on Captain Glen/The New York Trader, Roud #478.

Interestingly, the probabilities mainly align with the Roud numbers of the other songs, except for The Titanic (6) which is separated from other versions. According to the Clare County Library, this version in fact has Roud #6662 and not #4173. The source, DT, has no Roud number, which indicates the number was misapplied by one of my algorithms, probably the filename allocation.

It could be possible that 'leaf' cluster selection might give better results given the many small clusters. Visualising the clustering tree or looking at the probabilities could give more insights, but the sklearn version of HDBSCAN doesn't support this feature, so I will continue with the original version of the model.

Method 2: Clustering using original hdbscan.HDBSACAN¶

The version of HDBSCAN currently officially implemented by sklearn also does not have the prediction_data=True parameter which allows us to do soft clustering. Instead I'll load the standalone 'contributions' version from which it is derived (from Github - there is a bug in the current main one). This time I will fit the model with the extra parameters available.

Out[68]:
HDBSCAN(gen_min_span_tree=True, min_cluster_size=3, min_samples=2,
        prediction_data=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HDBSCAN(gen_min_span_tree=True, min_cluster_size=3, min_samples=2,
        prediction_data=True)

Using min_samples=2 I could achieve a similar number of clusters to the sklearn version, at 129. Using any higher value resulted three or fewer clusters.

Out[69]:
129
Out[70]:
array([1., 0., 1., ..., 0., 0., 1.])

118 samples have probabilities of less than 1.

Out[81]:
cluster_label probability
15 39 0.834146
31 107 0.928249
50 66 0.737086
54 125 0.924799
56 64 0.906728
... ... ...
1190 74 0.950982
1213 128 0.988382
1249 28 0.914149
1250 28 0.936858
1253 116 0.879220

117 rows × 2 columns

Cluster Analysis¶

Dendrogram and distance metrics¶

Here is the tree of splits from the new clusterer. It has a potentially unusual left-branching shape, which may indicate that there could be a more efficient way to split the data, but it's not yet clear how.

Figure
Distance metrics¶

Distance metrics affect the 'closeness' of points, and the choice of metric might therefore depend on what kind of closeness should be measured in a given task. Let's see if different distance metrics make a difference to how splits are discovered by the clustering model (min_cluster_size=3, min_samples=2).

Figure

Soft clustering membership probabilities (multiple clusters per point)¶

According to the docs, the probability that point i is a member of cluster j is in membership_vectors[i, j]. I extracted the first, second and third choices of clusters for each data point and combined them with the hard clustering results.

Out[86]:
name roud roud_count lyrics cluster_label probability soft_cluster_0 soft_probability_0 soft_cluster_1 soft_probability_1 soft_cluster_2 soft_probability_2
0 The Song of Agincourt V29347 3.0 Deo gracias anglia, Redde pro victoria, 1 Owre... 21 1.0 21 1.000000 34 1.330760e-306 117 1.321288e-306
1 O Falmouth Is a Fine Town 269 7.0 Text supplied by Don Duncan. Reportedly writte... -1 0.0 119 0.008932 101 8.849026e-03 127 8.712979e-03
2 Atisket, Atasket (I Sent a Letter to My Love) 13188 4.0 I wrote a letter to my love; I carried water i... 5 1.0 5 1.000000 64 1.295204e-306 103 1.283226e-306
3 Atisket, Atasket (I Sent a Letter to My Love) 13188 4.0 And the night before; if he does again to-nigh... 5 1.0 5 1.000000 83 1.389333e-306 88 1.354348e-306
4 Atisket, Atasket (I Sent a Letter to My Love) 13188 4.0 A green leather basket; I wrote a letter to my... 5 1.0 5 1.000000 64 1.418865e-306 86 1.411175e-306
... ... ... ... ... ... ... ... ... ... ... ... ...
1254 Young Barbour 64 4.0 'Twas of a lady in the west counteree,\nShe wa... 128 1.0 128 1.000000 126 1.860417e-306 124 1.826845e-306
1255 Young Hunting 47 5.0 It happened on one evening late,\nAs the maid ... -1 0.0 128 0.009523 124 9.239774e-03 126 8.754074e-03
1256 Young Hunting 2 47 5.0 A lady stood in her bower door,\nIn her bower ... -1 0.0 124 0.009534 126 9.189294e-03 129 9.078690e-03
1257 Young Redin 47 5.0 Young Redin's til the hunting gane\nWi' therty... -1 0.0 117 0.008927 128 8.532307e-03 95 8.371151e-03
1258 Young Sailor Cut Down in His Prime 2 23.0 One day as I strolled down by the Royal Albion... 76 1.0 76 1.000000 122 1.641345e-306 127 1.570397e-306

1259 rows × 12 columns

Let's return to cluster 107. I now also include points that would have been considered 107 if they had been clustered, but were considered too noisy. Note that the cluster membership probabilities work differently in soft format.

Out[88]:
name roud roud_count lyrics cluster_label probability soft_cluster_0 soft_probability_0 soft_cluster_1 soft_probability_1 soft_cluster_2 soft_probability_2
30 Captain Glen's Unhappy Voyage to New Barbary 478 3.0 There was a ship, and a ship of fame, Launched... 107 1.000000 107 1.000000 106 1.711467e-306 100 1.700073e-306
856 The New York Trader 478 3.0 To a New York trader I dld belong.\nShe was we... 107 1.000000 107 1.000000 125 1.708829e-306 127 1.653509e-306
1241 William Glen 478 3.0 There was a ship and a ship of fame .\nLaunch'... 107 1.000000 107 1.000000 106 1.716014e-306 127 1.702851e-306
1119 The Titanic 6 4173 5.0 You feeling hearted Christians, oh, listen to ... 107 0.945796 107 0.010022 127 9.545382e-03 125 9.063358e-03
31 Captain Ward and the Rainbow [Child 287] 224 3.0 Strike up, ye lust gallants, With music beat o... 107 0.928249 107 0.009776 127 9.413565e-03 106 9.251247e-03
372 The Calabar 1079 3.0 Come all ye dry-land sailors and listen to my ... 107 0.921126 107 0.009737 125 9.090746e-03 72 8.763059e-03
206 Andrew Ross (Andrew Rose) 623 3.0 Come all you seamen and give attention\nAnd li... -1 0.000000 107 0.009425 100 8.841495e-03 106 8.790492e-03
252 The Bayou Sara (2) 10010 and 4139 3.0 Sol Matting he lied a-sleeping,\nPoor boy was ... -1 0.000000 107 0.008255 125 8.126335e-03 47 8.104438e-03
305 Blow Ye Winds in the Morning (Ii) 2012 3.0 It's advertised in Boston, New York and Buffal... -1 0.000000 107 0.007852 101 7.799780e-03 100 7.403295e-03
383 Christofo Columbo 4843 3.0 I'll sing to you about a man whose name you'll... -1 0.000000 107 0.008912 101 8.865074e-03 125 8.834819e-03
418 The Cruel Ship's Captain 623 3.0 A boy to me was bound apprenticed\nBecause his... -1 0.000000 107 0.008970 125 8.824942e-03 79 8.372623e-03
628 The House Down in Carne (Nuke Power) 14 4.0 Well me name is Nuke Power, a terror am I,\nI ... -1 0.000000 107 0.008166 67 7.903810e-03 84 7.855460e-03
820 The Manchester Canal 1079 3.0 O the S.S. Irwell left this port the stormy se... -1 0.000000 107 0.008826 125 8.253970e-03 100 8.187633e-03
205 Andrew Rose 623 3.0 Andrew Rose, the British sailor\nNow to you hi... -1 0.000000 107 0.007011 100 6.996008e-03 106 6.655543e-03
1037 Sir Patrick Spens 41 3.0 The King sits in Dumferlane toon\nA-drinkin' a... -1 0.000000 107 0.009463 106 9.309907e-03 84 9.178243e-03
1041 Skye Boat Song 3772 3.0 Speed bonnie boat, like a bird on the wing,\nO... -1 0.000000 107 0.008053 106 7.932637e-03 40 7.861972e-03
384 Christopher Columbo 4843 3.0 In fourteen hundred and ninety-two\nA man whos... -1 0.000000 107 0.008030 57 7.883588e-03 127 7.763614e-03

Visualisation¶

Now I want to view my embeddings and clusters in a 2D or 3D space. There are specialised tools for this which combine the steps of feature reduction and visualisation, but I have already calculated t-SNE in two dimensions so I can use this for plotting in 2D directly. Using a slightly more advanced plotter, plotly.express, will allow me to inspect the data points and clusters better.

Here we can observe that by using t-SNE we can see clusters that HDBSCAN could not. Let's try just feeding the t-SNE values of the vectors directly into the model.

Method 3: Clustering on T-SNE instead¶

Out[181]:
HDBSCAN(gen_min_span_tree=True, min_cluster_size=3, min_samples=2,
        prediction_data=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HDBSCAN(gen_min_span_tree=True, min_cluster_size=3, min_samples=2,
        prediction_data=True)
Out[242]:
level_0 key_name name version_in_key bi_file dt_file roud lyrics roud_count lyric_embed_instructor tsne pca cluster_label probability cluster_tsne
158 1056 Teasing Songs Teasing Songs A EM256 SWTVILT2 10232 and 10404 My father's a lavatory cleaner, He works all t... 3.0 [-0.024734592, 0.008803896, -0.017447433, 0.00... [-13.657355308532715, -14.710352897644043] [0.11169321089982986, -0.05945173650979996] -1 0.0 -1
468 3373 NaN Down the River(2) NaN R592 DOWNRIV2 7677 Oh! the river is up and the channel is deep,\n... 3.0 [-0.060044736, -0.007352145, -0.031162467, 0.0... [30.202899932861328, 32.370948791503906] [0.11235740780830383, 0.059009138494729996] -1 0.0 -1
819 6177 NaN Man of Constant Sorrow NaN CSW113 CONSTSOR 499 I am a man of constant sorrow\nI've seen troub... 3.0 [-0.06407344, -0.0005744899, -0.064674266, 0.0... [18.663698196411133, 0.9300674200057983] [0.11235469579696655, -0.009118037298321724] -1 0.0 -1
813 6082 NaN Lowlands of Holland 6 NaN R083 LOWHOLL6 484 My love has built a bonnie ship and set her on... 12.0 [-0.042404126, -0.022763843, -0.025479745, 0.0... [-6.164587020874023, 22.162565231323242] [-0.1421692967414856, 0.12485475838184357] -1 0.0 -1
805 6074 NaN Lowlands NaN C286 VANTYGL9 122 A boy he had an auger\nThat bored two holes at... 11.0 [-0.035781313, -0.0020581644, -0.038625307, 0.... [-5.4679741859436035, 31.982099533081055] [0.08425074070692062, 0.061814501881599426] -1 0.0 -1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
60 400 Green Gravel Green Gravel F R532 GRNGRAVL 1368 similar to the above, in which the King sends ... 7.0 [-0.049485464, 0.011227211, -0.017543187, 0.04... [-6.266772747039795, -3.24650239944458] [-0.08977943658828735, 0.019546043127775192] -1 0.0 229
624 4695 NaN The Holland Handkerchief NaN NaN SUFFMRC4 246 There was a lord lived in this town;\nHis prai... 4.0 [-0.041949566, -0.011385148, -0.028533552, 0.0... [-5.777307987213135, -5.01576566696167] [-0.1859288066625595, -0.037052709609270096] 128 1.0 229
1212 9506 NaN Lady Margaret NaN C077 WILIGHO2 50 1.Lady Margaret sitting in her own lone home,\... 3.0 [-0.0379538, 0.0055425367, -0.033770416, 0.040... [-5.130154609680176, -9.474664688110352] [-0.09428334981203079, -0.07807657867670059] -1 0.0 230
1217 9545 NaN Sweet William's Ghost NaN C077 WILIGHOS 50 There came a ghost to Margaret's door\nWith ma... 3.0 [-0.044762462, 0.0029134688, -0.043984536, 0.0... [-5.1066575050354, -9.418922424316406] [-0.11307947337627411, 0.013046366162598133] -1 0.0 230
870 6861 NaN Oh, Who Is at My Bedroom Window? NaN LM04 SILVDAG3 22620 and 22621 Oh, who is at my bedroom window,\nDisturbing m... 5.0 [-0.028516868, -0.018946797, -0.026683377, 0.0... [-4.7569098472595215, -8.129509925842285] [-0.1337743103504181, -0.07173246890306473] -1 0.0 230

1259 rows × 15 columns

Clustering on the dimension-reduced data finds more (230) and larger groups, and treats less data with noise. It also, however, increases potential false matches: 122 clusters now have mixed Roud numbers, compared to 24.

Out[185]:
230
Out[267]:
name roud
cluster_tsne
-1 168 96
219 15 10
172 24 8
131 9 7
212 11 7
... ... ...
109 5 2
108 4 2
105 3 2
103 4 2
230 3 2

122 rows × 2 columns

The T-SNE cluster 219 has the largest selection of Roud numbers. When we examine it we see a collection of Scotland-related songs. This was also visible in a mixed-Roud cluster in Method 1 which contained some Scots and Scottish dialect words (more than here), leading me to suspect that these are limitations in using these embeddings on some varieties of English. Note: the texts also contain archaic language but the model seems to handle this better.

Out[249]:
name lyrics roud roud_count cluster_tsne probability_tsne
485 Fair Flower of Northumberland The provost's aye daughter was making her lane... 25 3.0 219 1.000000
486 Fair Flower of Northumberland 2 It was a knight in Scotland born,\nFollow, my ... 25 3.0 219 1.000000
613 The Heiress of Northumberland "Why, fair maid, have pity on me,"\nWaly's my ... 25 3.0 219 1.000000
1213 The Lord of Scotland The Lord of Scotland, he is come home\nUnto hi... 47 5.0 219 1.000000
788 Lord Banner Four and twenty ladies,\nThey being at a ball,... 52 4.0 219 0.984013
85 Lochinvar Sir Walter Scott's adaption of the above. He s... 93 4.0 219 0.925318
725 Katherine Jaffry (Lochnagar) Lochnagar cam frae the west\nInto the low coun... 93 4.0 219 0.925318
84 Katharine Jaffray [Child 221] 1 There livd a lass in yonder dale, And doun i... 93 4.0 219 0.880717
599 The Grey Silkie of Sule Skerry In Norwa land, there lived a maid\nBaloo, my b... 197 3.0 219 0.879330
748 Lady Odivere (Grey Silkie 3) In Norowa a lady bade\nA bonny lass in muckle ... 197 3.0 219 0.879330
729 Kellyburnbraes There lived a carl in Kellyburnbraes,\nHey and... 160 7.0 219 0.725164
112 Mother, Mother, Make My Bed She called to her little page boy, Who was her... 45 4.0 219 0.725115
737 Knight and the Shepherd's Daughter 3 Earl Richard, once upon a day,\nAnd all his va... 67 5.0 219 0.722462
322 Bonnie Annie There was a rich merchant wha lived in Strathd... 172 3.0 219 0.702716
373 Captain Wedderburn's Courtship The Laird o' Roslin's daughter\nWalked through... 36 3.0 219 0.702716

Interestingly, the model is 100% certain that The Lord of Scotland belongs with the songs at Roud #25. Although officially it belongs to Roud #47, it was collected in the 1940s from George Edwards of Vermont, USA. R. Matteson remarks, "It's clear to me that some of his ballads are recreations from print sources. [...] It's hard to tell what is traditional."

Although the model could be bringing new insights, it most likely needs tuning first. Let's see if it's possible to find a configuration using either raw embeddings or T-SNE clustering that sorts the most data possible without over-capturing.

Tuning using Density Based Clustering Validation¶

RandomizedSearchCV is a handy tool to try out all of the model's hyperparameters and test them against an appropriate metric. Density Based Clustering Validation produces a score from -1 to 1, with a higher value indicating good average cluster densites, and, therefore probably a better clustering solution. However, the metric does not directly measure the fit of the data (given that there are no labels in clustering) and the implementation in HDBSCAN is an approximation of DBCV.

  • min_samples "The simplest intuition for what min_samples does is provide a measure of how conservative you want you clustering to be. The larger the value of min_samples you provide, the more conservative the clustering" (docs)
  • min_cluster_size can be at lowest 2, and the dataset already has data that should form valid clusters of 3, so I am limited to trying 2 or 3
  • metric was explored above and metrics are already stored in distance_metrics so I will test all of these.
  • cluster_selection_method has always been eom so far (Expectation of Mass - stable clusters) but leaf might be suitable for this data due to the small clusters.
Best Parameters {'min_samples': 9, 'min_cluster_size': 2, 'metric': 'cityblock', 'cluster_selection_method': 'eom'}
DBCV score :1.9824646409581193e-06

Best Parameters for instructor embeddings:

  • 'min_samples': 9 (note: only tested 1-10; this took four minutes)
  • 'min_cluster_size': 2
  • 'metric': 'cityblock'
  • 'cluster_selection_method': 'eom'

DBCV score: 0.0000019825

Best Parameters {'min_samples': 6, 'min_cluster_size': 3, 'metric': 'l2', 'cluster_selection_method': 'leaf'}
DBCV score: 0.1190531391
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
DBCV score: 0.1452398871
Best Parameters {'min_samples': 6, 'min_cluster_size': 2, 'metric': 'l1', 'cluster_selection_method': 'eom'}
DBCV score: 0.0707059985
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'manhattan', 'cluster_selection_method': 'eom'}
DBCV score: 0.1701129252
Best Parameters {'min_samples': 5, 'min_cluster_size': 3, 'metric': 'chebyshev', 'cluster_selection_method': 'eom'}
DBCV score: 0.1710791147
Best Parameters {'min_samples': 6, 'min_cluster_size': 2, 'metric': 'manhattan', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0828771347
Best Parameters {'min_samples': 8, 'min_cluster_size': 2, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
DBCV score: 0.0034253212
Best Parameters {'min_samples': 6, 'min_cluster_size': 3, 'metric': 'l2', 'cluster_selection_method': 'leaf'}
DBCV score: 0.1190531391
Best Parameters {'min_samples': 9, 'min_cluster_size': 2, 'metric': 'p', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0884996060
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'braycurtis', 'cluster_selection_method': 'eom'}
DBCV score: 0.1809713556

Unfortunately the testing was not successful, the results depending more on the random state than any particular configuration of hyperparameters:

'min_samples': range(1, 100):

Best Parameters {'min_samples': 837, 'min_cluster_size': 2, 'metric': 'canberra', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0027403886
Best Parameters {'min_samples': 789, 'min_cluster_size': 2, 'metric': 'l1', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0000000000
Best Parameters {'min_samples': 291, 'min_cluster_size': 3, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
DBCV score: 0.0000000000
Best Parameters {'min_samples': 265, 'min_cluster_size': 2, 'metric': 'braycurtis', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0012053879
Best Parameters {'min_samples': 83, 'min_cluster_size': 3, 'metric': 'cityblock', 'cluster_selection_method': 'eom'}
DBCV score: 0.0000000000
Best Parameters {'min_samples': 533, 'min_cluster_size': 3, 'metric': 'manhattan', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0000000000
Best Parameters {'min_samples': 791, 'min_cluster_size': 3, 'metric': 'braycurtis', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0000017915
Best Parameters {'min_samples': 769, 'min_cluster_size': 2, 'metric': 'braycurtis', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0001857824
Best Parameters {'min_samples': 974, 'min_cluster_size': 3, 'metric': 'euclidean', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0000000000
Best Parameters {'min_samples': 924, 'min_cluster_size': 2, 'metric': 'infinity', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0012254705

'min_samples': range(1, 10):

Best Parameters {'min_samples': 6, 'min_cluster_size': 3, 'metric': 'l2', 'cluster_selection_method': 'leaf'}
DBCV score: 0.1190531391
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
DBCV score: 0.1452398871
Best Parameters {'min_samples': 6, 'min_cluster_size': 2, 'metric': 'l1', 'cluster_selection_method': 'eom'}
DBCV score: 0.0707059985
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'manhattan', 'cluster_selection_method': 'eom'}
DBCV score: 0.1701129252
Best Parameters {'min_samples': 5, 'min_cluster_size': 3, 'metric': 'chebyshev', 'cluster_selection_method': 'eom'}
DBCV score: 0.1710791147
Best Parameters {'min_samples': 6, 'min_cluster_size': 2, 'metric': 'manhattan', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0828771347
Best Parameters {'min_samples': 8, 'min_cluster_size': 2, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
DBCV score: 0.0034253212
Best Parameters {'min_samples': 6, 'min_cluster_size': 3, 'metric': 'l2', 'cluster_selection_method': 'leaf'}
DBCV score: 0.1190531391
Best Parameters {'min_samples': 9, 'min_cluster_size': 2, 'metric': 'p', 'cluster_selection_method': 'leaf'}
DBCV score: 0.0884996060
Best Parameters {'min_samples': 4, 'min_cluster_size': 3, 'metric': 'braycurtis', 'cluster_selection_method': 'eom'}
DBCV score: 0.1809713556

Note: the measure relative_validity_ (approximation of the DBCV score) only works to compare results across different choices of hyper-parameters, therefore I cannot easily compare to the two previous models used. I can also not technically compare the two tests on different data, but for want of a better metric in the time I have, I'll continue with the T-SNE data and the output with the highest score.

"Best Parameters" for T-SNE dimension-reduced embeddings:

  • 'min_samples': 4
  • 'min_cluster_size': 3
  • 'metric': 'braycurtis'
  • 'cluster_selection_method': 'eom'

DBCV score: 0.0027403886

Failed tuned model:¶

Out[326]:
HDBSCAN(gen_min_span_tree=True, metric='braycurtis', min_cluster_size=3,
        min_samples=4, prediction_data=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
HDBSCAN(gen_min_span_tree=True, metric='braycurtis', min_cluster_size=3,
        min_samples=4, prediction_data=True)

Prediction¶

Thankfully, I already have a model that can semi-reliably cluster similar Roud numbers, so I can try some predictions.

Out[358]:
index key_name name version_in_key bi_file dt_file roud lyrics roud_count lyric_embed_instructor tsne test_label strength
0 0 A Robin, Jolly Robin A Robyn Jolly Robyn A Perc1185 HEYROBIN NaN "[F]rom what appears to be the most ancient of... NaN [-0.032208655, -0.0039244993, -0.02159848, 0.0... [31.15694808959961, -52.046730041503906] 11 0.191501
1 1 A Robin, Jolly Robin (No Title) B Perc1185 HEYROBIN NaN 71 'Hey, Robin, jolly Robin, 72 Tell me how... NaN [-0.032028995, 0.020379173, -0.016789645, 0.03... [6.489591121673584, -72.48700714111328] -1 0.000000
2 2 A, U, Hinny Bird A, U, Hinny Bird A StoR160 NaN 235 A, U, hinny burd; The bonny lass o' Benwell, A... 1.0 [-0.025857605, 0.010645705, -0.02403562, 0.050... [0.7492282390594482, -59.18363571166992] 31 0.599214
3 3 Adieu to Erin (The Emigrant) Adieu to Erin A SWMS255 NaN 2068 Oh, when I breathed a last adieu, To Erin's an... 1.0 [-0.043128256, 0.008317871, -0.040352777, 0.01... [35.04929733276367, -3.3678462505340576] 131 0.546298
4 4 Agincourt Carol, The The Song of Agincourt A MEL51 AGINCRT1 V29347 Deo gracias anglia, Redde pro victoria, 1 Owre... 3.0 [-2.9962948e-05, -0.009317334, -0.017600924, 0... [-28.1324462890625, -51.7108154296875] -1 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
9968 10108 NaN Zeb Tourney's Girl NaN LE18 ZEBTURNY 2249 Down in the Tennessee mountains,\nFar from the... 2.0 [-0.0389951, -0.008086728, -0.058877233, 0.037... [-47.99359893798828, 18.380474090576172] 16 0.239531
9969 10109 NaN Zebra Dun NaN LB16 ZEBRADUN 3237 We was camped on the plains at the head of the... 1.0 [-0.044695836, -0.012055198, -0.042596623, 0.0... [-42.704925537109375, 19.89595603942871] 18 0.317023
9970 10110 NaN Zen Gospel Singing NaN NaN ZENGOSPE NaN I once was a Baptist and on each Sunday morn\n... NaN [-0.043211307, -0.0138436, -0.017630804, 0.003... [24.264463424682617, 34.33056640625] 78 0.351471
9971 10111 NaN Zuleika NaN NaN ZULIKA NaN Zuleika was fair to see,\nA fair Persian maide... NaN [-0.0115272915, -0.029302498, -0.022729361, 0.... [22.684234619140625, -7.372575759887695] -1 0.000000
9972 10112 NaN The Zulu King NaN NaN ZULUKING NaN Oh the Zulu king with the big nose-ring\nFell ... NaN [-0.043027755, -0.010383786, -0.020057475, 0.0... [55.304954528808594, 21.080873489379883] -1 0.000000

9973 rows × 13 columns

Future directions:¶

  • App for Roud prediction
  • Tool for detecting Roud indexing errors in song databases
  • Topic modelling https://docs.cohere.com/page/topic-modeling
  • Network/d3 graphing https://erdogant.github.io/hnet/pages/html/Plots.html#static-graph , https://blog.scottlogic.com/2020/05/01/rendering-one-million-points-with-d3.html, https://networkx.org/documentation/stable/tutorial.html#adding-attributes-to-graphs-nodes-and-edges
  • Generation of new 'traditional' songs based on a cluster